Multicore platforms are becoming prevalent, and someone needs to
program them. Initial multicore experiences for most embedded
programmers will be with coherent shared memory systems. Compared to
single core systems, these shared memory systems are much more
challenging to program correctly.
Nevertheless, with an incremental development and test approach to
parallelism and a willingness to apply lessons learned by previous
parallel programmers, successful systems are being deployed today using
existing C/C++ environments.
It's too Hot
The free lunch is over for programmers. [1] Though Moore's law
marches on, and the
number of economically manufacturable transistors per chip continues
increasing, clock frequencies have hit a wall because of power
dissipation. It's gotten too hot to handle.
Instead of increasing the clock frequency, designers can use larger
transistor budgets to do more work per clock cycle. Within a single
processor pipeline, techniques such as instruction-level parallelism,
hardware
threads, and data-parallel
(SIMD) instructions have reached the point of diminishing
returns.
It now makes more hardware sense to add multiple processor cores on
chip and turn to task level parallelism. It's left to software
engineers to properly exploit these multicore architectures.
Multicore systems (Figure 1, below)
are typically characterized by number and type of
cores, memory organization, and interconnection network. From a
programming model perspective, it is useful to consider the memory
architecture first.
 |
| Figure
1: Multicore Architectures |
Memory architectures can be broadly classified as shared or
distributed. In a typical shared memory
all cores uniformly share the same memory. Cores share information by
accessing the same memory locations.
<>Lightweight threads, defined as multiple instruction streams sharing
the same memory space, are a natural abstraction for a shared memory
programming model. The programming model is familiar to multithreading
programmers of
single core systems. Vendors in both desktop/server and embedded
markets offer coherent shared memory systems, so there are a growing
number of shared memory platforms available to programmers.
>
In a typical distributed memory system, memory units are closely
coupled to their cores. Each core manages its own memory, and cores
communicate information by sending and receiving data between them.
Processes running on different cores and sharing data through message
passing, are a common abstraction for a distributed memory programming
model.
In shared memory systems, data communication is implicit; data is
shared between threads simply by accessing the same memory location. If
the cores use cache memories, their view of main memory must be kept
coherent between them.
As the number of cores increases, the cost of maintaining coherence
between caches rises quickly, so it is unlikely this architecture will
scale effectively to hundreds of cores.
However, with distributed memory architectures, the hardware design
scales relatively easily. Since memory is not shared, the programmer
must explicitly describe inter-core communication, and interconnection
network performance becomes important.
Driven by the advantages of matching multiple execution pipelines to
shared memory, it's probable that a hybrid on-chip architecture (Figure 2 below) will
emerge as the number of cores per chip increases. This architecture is
already in use at the board level to connect clusters of shared memory
chips.
 |
| Figure
2: Hybrid Distributed Shared Memory Architecture. |
It is likely that most programmers' initial multicore experience
will involve some type of shared memory platform. Though the
programming model appears straightforward, these systems are
notoriously difficult to program correctly.