Getting started with multicore programming: Part 1
What makes parallelizing C-code so hard
By Skip Hovsmith, CriticalBlue
Embedded.com
(07/07/08, 11:55:00 PM EDT)
Multicore platforms are becoming prevalent, and someone needs to program them. Initial multicore experiences for most embedded programmers will be with coherent shared memory systems. Compared to single core systems, these shared memory systems are much more challenging to program correctly.

Nevertheless, with an incremental development and test approach to parallelism and a willingness to apply lessons learned by previous parallel programmers, successful systems are being deployed today using existing C/C++ environments.

It's too Hot
The free lunch is over for programmers. [1] Though Moore's law marches on, and the number of economically manufacturable transistors per chip continues increasing, clock frequencies have hit a wall because of power dissipation. It's gotten too hot to handle.

Instead of increasing the clock frequency, designers can use larger transistor budgets to do more work per clock cycle. Within a single processor pipeline, techniques such as instruction-level parallelism, hardware threads, and data-parallel (SIMD) instructions have reached the point of diminishing returns.

It now makes more hardware sense to add multiple processor cores on chip and turn to task level parallelism. It's left to software engineers to properly exploit these multicore architectures.

Multicore systems (Figure 1, below) are typically characterized by number and type of cores, memory organization, and interconnection network. From a programming model perspective, it is useful to consider the memory architecture first.

Figure 1: Multicore Architectures

Memory architectures can be broadly classified as shared or distributed.  In a typical shared memory all cores uniformly share the same memory. Cores share information by accessing the same memory locations.

<>Lightweight threads, defined as multiple instruction streams sharing the same memory space, are a natural abstraction for a shared memory programming model. The programming model is familiar to multithreading programmers of single core systems. Vendors in both desktop/server and embedded markets offer coherent shared memory systems, so there are a growing number of shared memory platforms available to programmers.

In a typical distributed memory system, memory units are closely coupled to their cores. Each core manages its own memory, and cores communicate information by sending and receiving data between them. Processes running on different cores and sharing data through message passing, are a common abstraction for a distributed memory programming model.

In shared memory systems, data communication is implicit; data is shared between threads simply by accessing the same memory location. If the cores use cache memories, their view of main memory must be kept coherent between them.

As the number of cores increases, the cost of maintaining coherence between caches rises quickly, so it is unlikely this architecture will scale effectively to hundreds of cores.

However, with distributed memory architectures, the hardware design scales relatively easily. Since memory is not shared, the programmer must explicitly describe inter-core communication, and interconnection network performance becomes important.

Driven by the advantages of matching multiple execution pipelines to shared memory, it's probable that a hybrid on-chip architecture (Figure 2 below) will emerge as the number of cores per chip increases. This architecture is already in use at the board level to connect clusters of shared memory chips.

Figure 2: Hybrid Distributed Shared Memory Architecture.

It is likely that most programmers' initial multicore experience will involve some type of shared memory platform. Though the programming model appears straightforward, these systems are notoriously difficult to program correctly.

Why so Hard?
For a fixed number of processors, Amdahl's law states that the maximum speedup is limited by the percentage of a program which cannot be parallelized:

Where N is the number of processors and Ts represents the percentage of work which cannot be parallelized.

Consider a system with N = 4 and Ts = 20%. The maximum achievable speedup is 2.5x. Even if the number of processors goes to infinity, the maximum speedup is only 5x, limited entirely by the amount of work which must remain serialized.

So there is significant pressure to maximize the proportion of activity running in parallel. This may involve significant rework of existing code into parallel form or switching to alternative, less familiar algorithms which offer more inherent parallelism.

Expressing parallelism requires using new and unfamiliar APIs, leading to programmers introducing bugs while mastering the interfaces. More significantly, parallelism means multiple execution flows.

In a shared memory model, any execution flow can access any memory location, so there are many possible orderings of memory access. It's left to the programmer to determine that all orderings which can occur will result in correct operation.

When multiple threads access the same memory location, and at least one thread is writing to that location, the resultant data will depend on the interleaved execution order of the threads.

This race condition leads to unpredictable results which may be masked by external effects such as the thread scheduling policy. Proper operation requires forcing partial execution order by using mutual exclusion locks or critical sections.

These devices ensure that a thread has exclusive access to a memory location to guarantee the proper ordering of reads and writes. Unfortunately, every lock must be explicitly specified by the programmer.

Sequential code is often filled with many order dependencies, which are implicitly satisfied by the serialized execution of the program. Miss any dependency in the parallel scenario and a latent bug has been created.

The improper use of locks can also cause program execution to deadlock. If two threads attempt to lock two memory locations, in opposite order, each thread will obtain one lock but stall forever waiting for the opposite one. A chain of data-dependent locking relationships can make this difficult to catch.

To be more conservative, a programmer can restrict the order of execution by the liberal use of locks. Besides increasing the chance of deadlock, heavy-handed locking raises the amount of serialization which, when considering the locking overhead, quickly reduces the performance of the parallel program to less than the original sequential version.

From a testing perspective, the quality of tests for sequential code can be measured by path coverage. There are a significant number of possible paths, and stimulating paths deep in the code can be difficult.

The number of composite paths for parallel threads increases exponentially, and explicitly controlling those paths may be impossible. Experience from the world of hardware verification suggests that concurrent execution requires extensive time and resources dedicated to testing.

Nevertheless, shared memory systems are being successfully deployed.

Parallelize incrementally, Test frequently
With all these potential pitfalls, it seems appropriate to adopt a "first do no harm" mentality. The greatest challenge is usually not in finding the parallelism, but in exploiting it in a correct and efficient manner.

An initial methodology starts with sequential code and incrementally adds parallelism, testing the changes at each step of the way. The original code base should be functionally correct with a good test suite. The test suite should contain thorough black-box system tests which exercise the code without knowledge of how the functionality is implemented.

It should also include a profiling suite which pinpoints which portions of the code require the most execution time over a range of operating conditions. Considering Amdahl's law; it is crucial to focus first on areas in the code base which account for most of the execution time.

It's important to understand the original implementation. Leverage profiling and analysis tools to understand the ordering dependencies in the code that restrict how parallelism can be applied. Two types of dependencies to consider are read-after-write (RAW) and write-after-read (WAR).

RAW represents true data dependencies. In this case, a value cannot be read before it is written. This is satisfied in original code by the sequential statement ordering, but in parallel code, the read must come after the write, and no other write to the same location can occur between them. A write lock can be used to guarantee this ordering.

WAR dependencies are also known as anti-dependencies. In this case, a memory location cannot be written until after the previous value has been read. These dependencies often occur in sequential programs when memory is reused in different parts of a program to save space. The dependencies can be overcome by duplicating or buffering storage resources to account for overlapped execution.

On the practical side, it is important to understand the target platform. It makes sense to match the number of runnable threads to the number of execution pipelines.

For example, if running on a quad core platform where each core supports two hardware threads, eight runnable threads is a good target. Increasing the number of threads way beyond that will require overhead for thread startup, switching, and shutdown, eventually degrading the overall performance instead of accelerating it.

Data layout which is critical in single core systems is even more critical in shared memory architectures. Separate threads should access non-overlapping memory locations as much as possible to minimize contention between individual core caches or expect to pay a high price to maintain coherency.

Fortunately, there has been significant work performed in identifying best practices for decomposing programs into parallel regions. These patterns involve decomposing a problem in ways which maximize parallelism while satisfying data dependencies.

(See [2] for a comprehensive collection of parallel programming patterns applied at various levels of abstraction. A lot of insight can be gained by leveraging the experiences of others in similar situations.)

Understanding the original program, the target platform, and best practices all influence the refactoring of code from sequential to a proper parallel implementation. Each incremental change that increases parallelism also has the potential to introduce bugs, so take small steps and test frequently.

Example: Sobel Edge Detection Example
An implementation of the Sobel edge detection algorithm is used as a simple example to demonstrate different approaches for sequential to parallel code transformation.

The Sobel algorithm approximates the two dimensional gradient of the image at each point. The magnitude of the gradient represents how quickly the image is changing. A large value indicates a likely edge transition and shows up as white on the transformed image as shown for the standard grayscale Lena.  (Figure 3 below).

Figure 3: Grayscale Lena

The gradient is approximated by first estimating the horizontal and vertical derivatives using two 3x3 matrices:

The magnitude is further approximated by summing the absolute value of the gradients:

Listing 1 below is a C language implementation of the Sobel function.


Listing 1: Original Sobel Function

The Sobel function is sensitive to noise, so it is typical to run the image through a smoothing filter first. A linear 5x5 filter is shown in Listing 2 below.

Listing 2: Original Filter Function

The results of running the smoothing filter and Sobel function sequentially on the Lena image are shown in Figure 3 above. Profiling the two functions shows that the smoothing function takes approximately twice as long as the Sobel function.

Next in Part 2: Mulththreading in C

Skip Hovsmith is the Director of Application Engineering in the US for CriticalBlue, working on accelerating embedded software through synthesis of programmable multicore processors for structured ASICs, SoCs, and FPGAs. Prior to CriticalBlue, Skip worked for several start ups in formal verification, FPGA design, and enterprise software services, and at National Semiconductor working on virtual prototyping, hw/sw co-design, reconfigurable systems, and standard cell product design.

References
[1] Sutter, Herb. "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal, 30(3), March 2005.
[2] Mattson, Timothy G., Beverly A. Sanders, and Berna L. Massingill, Patterns for Parallel Programming. Boston: Addison-Wesley, 2005.
[3] More about multicores and multiprocessors