Why so Hard?
For a fixed number of processors, Amdahl's
law states that the maximum
speedup is limited by the percentage of a program which cannot be
parallelized:

Where N is the number of processors and Ts represents the percentage
of work which cannot be parallelized.
Consider a system with N = 4 and Ts = 20%. The maximum achievable
speedup is 2.5x. Even if the number of processors goes to infinity, the
maximum speedup is only 5x, limited entirely by the amount of work
which must remain serialized.
So there is significant pressure to maximize the proportion of
activity running in parallel. This may involve significant rework of
existing code into parallel form or switching to alternative, less
familiar algorithms which offer more inherent parallelism.
Expressing parallelism requires using new and unfamiliar APIs,
leading to programmers introducing bugs while mastering the interfaces.
More significantly, parallelism means multiple execution flows.
In a shared memory model, any execution flow can access any memory
location, so there are many possible orderings of memory access. It's
left to the programmer to determine that all orderings which can occur
will result in correct operation.
When multiple threads access the same memory location, and at least
one thread is writing to that location, the resultant data will depend
on the interleaved execution order of the threads.
This race condition leads to unpredictable results which may be
masked by external effects such as the thread scheduling policy. Proper
operation requires forcing partial execution order by using mutual
exclusion locks or critical sections.
These devices ensure that a thread has exclusive access to a memory
location to guarantee the proper ordering of reads and writes.
Unfortunately, every lock must be explicitly specified by the
programmer.
Sequential code is often filled with many order dependencies, which
are implicitly satisfied by the serialized execution of the program.
Miss any dependency in the parallel scenario and a latent bug has been
created.
The improper use of locks can also cause program execution to
deadlock. If two threads attempt to lock two memory locations, in
opposite order, each thread will obtain one lock but stall forever
waiting for the opposite one. A chain of data-dependent locking
relationships can make this difficult to catch.
To be more conservative, a programmer can restrict the order of
execution by the liberal use of locks. Besides increasing the chance of
deadlock, heavy-handed locking raises the amount of serialization
which, when considering the locking overhead, quickly reduces the
performance of the parallel program to less than the original
sequential version.
From a testing perspective, the quality of tests for sequential code
can be measured by path coverage. There are a significant number of
possible paths, and stimulating paths deep in the code can be
difficult.
The number of composite paths for parallel threads increases
exponentially, and explicitly controlling those paths may be
impossible. Experience from the world of hardware verification suggests
that concurrent execution requires extensive time and resources
dedicated to testing.
Nevertheless, shared memory systems are being successfully deployed.