Parallelize incrementally, Test
frequently
With all these potential pitfalls, it seems appropriate to adopt a
"first do no harm" mentality. The greatest challenge is usually not in
finding the parallelism, but in exploiting it in a correct and
efficient manner.
An initial methodology starts with sequential code and
incrementally
adds parallelism, testing the changes at each step of the way. The
original code base should be functionally correct with a good
test suite. The test suite should contain thorough black-box system
tests which exercise the code without knowledge of how the
functionality is implemented.
It should also include a profiling suite which pinpoints which
portions of the code require the most execution time over a range of
operating conditions. Considering Amdahl's law; it is crucial to focus
first on areas in the code base which account for most of the execution
time.
It's important to understand the original implementation. Leverage
profiling and analysis tools to understand the ordering dependencies in
the code that restrict how parallelism can be applied. Two types of
dependencies
to consider are read-after-write
(RAW) and
write-after-read (WAR).
RAW represents true data dependencies. In this case, a value cannot
be read before it is written. This is satisfied in original code by the
sequential statement ordering, but in parallel code, the read must come
after the write, and no other write to the same location can occur
between them. A write lock can be used to guarantee this ordering.
WAR dependencies are also known as anti-dependencies. In this case,
a memory location cannot be written until after the previous value has
been read. These dependencies often occur in sequential programs when
memory is reused in different parts of a program to save space. The
dependencies can be overcome by duplicating or buffering storage
resources to account for overlapped execution.
On the practical side, it is important to understand the target
platform. It makes sense to match the number of runnable threads to the
number of execution pipelines.
For example, if running on a quad core platform where each core
supports two hardware threads, eight runnable threads is a good target.
Increasing the number of threads way beyond that will require overhead
for thread startup, switching, and shutdown, eventually degrading the
overall performance instead of accelerating it.
Data layout which is critical in single core systems is even more
critical in shared memory architectures. Separate threads should access
non-overlapping memory locations as much as possible to minimize
contention between individual core caches or expect to pay a high price
to maintain coherency.
Fortunately, there has been significant work performed in
identifying best practices for decomposing programs into parallel
regions. These patterns involve decomposing a problem in ways which
maximize parallelism while satisfying data dependencies.
(See [2] for a comprehensive collection
of parallel programming
patterns applied at various levels of abstraction. A lot of insight can
be gained by leveraging the experiences of others in similar
situations.)
Understanding the original program, the target platform, and best
practices all influence the refactoring of code from sequential to a
proper parallel implementation. Each incremental change that increases
parallelism also has the potential to introduce bugs, so take small
steps and test frequently.
Example: Sobel Edge Detection
Example
An implementation of the Sobel edge
detection algorithm is used as a
simple example to demonstrate different approaches for sequential to
parallel code transformation.
The Sobel algorithm approximates the two dimensional gradient of the
image at each point. The magnitude of the gradient represents how
quickly the image is changing. A large value indicates a likely edge
transition and shows up as white on the transformed image as shown for
the standard grayscale
Lena.
(Figure 3 below).
 |
| Figure
3: Grayscale Lena |
The gradient is approximated by first estimating the horizontal and
vertical derivatives using two 3x3 matrices:

The magnitude is further approximated by summing the absolute value
of the gradients:

Listing 1 below is a C
language implementation of the Sobel
function.

 |
| Listing
1: Original Sobel Function |
The Sobel function is sensitive to noise, so it is typical to run
the image through a smoothing filter first. A linear 5x5 filter is
shown in Listing 2 below.
 |
| Listing
2: Original Filter Function |
The results of running the smoothing filter and Sobel function
sequentially on the Lena image are shown in Figure 3 above. Profiling
the two functions shows that the smoothing function takes approximately
twice as long as the Sobel function.
Next in Part 2: Mulththreading in C
Skip Hovsmith is the Director of
Application Engineering in the US
for CriticalBlue, working on
accelerating embedded software through
synthesis of programmable multicore processors for structured ASICs,
SoCs, and FPGAs. Prior to CriticalBlue, Skip worked for several start
ups in formal verification, FPGA design, and enterprise software
services, and at National Semiconductor working on virtual prototyping,
hw/sw co-design, reconfigurable systems, and standard cell product
design.
References
[1] Sutter, Herb. "
The Free Lunch Is Over: A Fundamental Turn
Toward
Concurrency in Software." Dr. Dobb's Journal, 30(3), March
2005.
[2] Mattson, Timothy G.,
Beverly A. Sanders, and Berna L. Massingill,
Patterns
for Parallel Programming. Boston: Addison-Wesley, 2005.
[3] More
about multicores and multiprocessors