CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Getting started with multicore programming: Part 1
What makes parallelizing C-code so hard



Embedded.com

Parallelize incrementally, Test frequently
With all these potential pitfalls, it seems appropriate to adopt a "first do no harm" mentality. The greatest challenge is usually not in finding the parallelism, but in exploiting it in a correct and efficient manner.

An initial methodology starts with sequential code and incrementally adds parallelism, testing the changes at each step of the way. The original code base should be functionally correct with a good test suite. The test suite should contain thorough black-box system tests which exercise the code without knowledge of how the functionality is implemented.

It should also include a profiling suite which pinpoints which portions of the code require the most execution time over a range of operating conditions. Considering Amdahl's law; it is crucial to focus first on areas in the code base which account for most of the execution time.

It's important to understand the original implementation. Leverage profiling and analysis tools to understand the ordering dependencies in the code that restrict how parallelism can be applied. Two types of dependencies to consider are read-after-write (RAW) and write-after-read (WAR).

RAW represents true data dependencies. In this case, a value cannot be read before it is written. This is satisfied in original code by the sequential statement ordering, but in parallel code, the read must come after the write, and no other write to the same location can occur between them. A write lock can be used to guarantee this ordering.

WAR dependencies are also known as anti-dependencies. In this case, a memory location cannot be written until after the previous value has been read. These dependencies often occur in sequential programs when memory is reused in different parts of a program to save space. The dependencies can be overcome by duplicating or buffering storage resources to account for overlapped execution.

On the practical side, it is important to understand the target platform. It makes sense to match the number of runnable threads to the number of execution pipelines.

For example, if running on a quad core platform where each core supports two hardware threads, eight runnable threads is a good target. Increasing the number of threads way beyond that will require overhead for thread startup, switching, and shutdown, eventually degrading the overall performance instead of accelerating it.

Data layout which is critical in single core systems is even more critical in shared memory architectures. Separate threads should access non-overlapping memory locations as much as possible to minimize contention between individual core caches or expect to pay a high price to maintain coherency.

Fortunately, there has been significant work performed in identifying best practices for decomposing programs into parallel regions. These patterns involve decomposing a problem in ways which maximize parallelism while satisfying data dependencies.

(See [2] for a comprehensive collection of parallel programming patterns applied at various levels of abstraction. A lot of insight can be gained by leveraging the experiences of others in similar situations.)

Understanding the original program, the target platform, and best practices all influence the refactoring of code from sequential to a proper parallel implementation. Each incremental change that increases parallelism also has the potential to introduce bugs, so take small steps and test frequently.

Example: Sobel Edge Detection Example
An implementation of the Sobel edge detection algorithm is used as a simple example to demonstrate different approaches for sequential to parallel code transformation.

The Sobel algorithm approximates the two dimensional gradient of the image at each point. The magnitude of the gradient represents how quickly the image is changing. A large value indicates a likely edge transition and shows up as white on the transformed image as shown for the standard grayscale Lena.  (Figure 3 below).

Figure 3: Grayscale Lena

The gradient is approximated by first estimating the horizontal and vertical derivatives using two 3x3 matrices:

The magnitude is further approximated by summing the absolute value of the gradients:

Listing 1 below is a C language implementation of the Sobel function.


Listing 1: Original Sobel Function

The Sobel function is sensitive to noise, so it is typical to run the image through a smoothing filter first. A linear 5x5 filter is shown in Listing 2 below.

Listing 2: Original Filter Function

The results of running the smoothing filter and Sobel function sequentially on the Lena image are shown in Figure 3 above. Profiling the two functions shows that the smoothing function takes approximately twice as long as the Sobel function.

Next in Part 2: Mulththreading in C

Skip Hovsmith is the Director of Application Engineering in the US for CriticalBlue, working on accelerating embedded software through synthesis of programmable multicore processors for structured ASICs, SoCs, and FPGAs. Prior to CriticalBlue, Skip worked for several start ups in formal verification, FPGA design, and enterprise software services, and at National Semiconductor working on virtual prototyping, hw/sw co-design, reconfigurable systems, and standard cell product design.

References
[1] Sutter, Herb. "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal, 30(3), March 2005.
[2] Mattson, Timothy G., Beverly A. Sanders, and Berna L. Massingill, Patterns for Parallel Programming. Boston: Addison-Wesley, 2005.
[3] More about multicores and multiprocessors

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :