For Kevin Kissell of MIPS, decisions about multi-core or multithreading are more than just either/or
By Kevin D. Kissell, MIPS Technologies Inc.
Is multithreading better than multi-core? Is multi-core better than
multithreading? One might as well ask whether a diesel engine is better
than four-wheel drive! The best vehicle for a given application might
have one, the other, or both. Or neither. They are independent " but
complementary " design decisions.
With multithreaded processors and multi-core chips becoming the
norm, architects and designers of digital systems need to understand
their respective attributes, advantages, and disadvantages.
Tapping Concurrency as a Resource
What multithreading and multi-core have in common is that they both
exploit the concurrency
in a computational workload. The cost, in silicon, energy, and
complexity, of making a CPU run a single instruction stream ever faster
goes up non-linearly, and eventually hits a wall imposed by the
physical limitations of circuit technology.
That wall keeps moving out a little further every year, but cost and
power-sensitive designs are constrained to follow the bleeding edge at
a safe distance. Fortunately, virtually all computer applications have
some degree of concurrency: At least some of the time, there are two or
more independent tasks that need to be performed. Taking advantage of
concurrency to improve computing performance and efficiency isn't
always trivial, but it's certainly easier than violating the laws of
physics.
Multi-processor, or multi-core, systems exploit concurrency to
spread work around a system. As many software tasks can run at the same
time as there are processors in the system. This can be used to improve
absolute performance, cost, or power/performance. Clearly, once one has
built the fastest single processor possible in a given technology, the
only way to get even more compute power is to use more than one of
them.
More subtly, if a load that would saturate a 1GHz processor could be
evenly spread across 4 processors, those processors could be run at
roughly 250MHz each. If each 250MHz processor is less than ¼ the
size of the 1GHz processor, or consumes less than one-fourth the power,
either of which may be the case because of the non-linear cost of
higher operating frequencies, the multi-core system might be more
economical.
Many designers of embedded SoCs are already exploiting concurrency
with multiple cores. Unlike general-purpose workstations and servers,
whose workload is variable and unknowable to system designers, it's
often possible to analyze and decompose a fixed set of embedded device
functions into specialized tasks, and assign tasks across multiple
processors, each of which has a specific responsibility, and each of
which can be specified and configured optimally for that specific job.
Multithreaded
processors also exploit
the concurrency of multiple tasks, but in a different way, and for a
different reason. Instead of a system-level technique to spread CPU
load, multithreading is a processor-level optimization to improve area
and energy efficiency.
Multithreaded architecture is driven to a large degree by the
observation that single-threaded high-performance processors spend a
surprising amount of time doing nothing. When the results of a memory
access are required for a program to advance, and that access must
reference RAM whose cycle time is tens of times slower than that of the
processor, a single-threaded processor is condemned to stall until the
data is returned.
The multithreading hypothesis can be stated as: If latencies prevent
a single task from keeping a processor pipeline busy, then a single
pipeline should be able to complete more than one concurrent task in
less time than it would take to run the tasks serially. This means
running more than one task's instruction stream, or thread, at a time,
which in turn means that the processor has to have more than one
program counter, and more than one set of programmable registers.
Replicating those resources is far less costly than replicating an
entire processor. In the MIPS32 34K processor, for example, which
implements the MIPS
MT multithreading architecture, an additional 14% of area can buy
an additional 60% of throughput, relative to a comparable
single-threaded core. (Measured using the EEMBC PKFLOW and OSPF
benchmarks, run sequentially on a MIPS32 24KE core versus concurrently
on a dual-threaded MIPS32 34K core.)
Multi-processor architectures are infinitely scalable, in theory.
However many processors one has, one can always imagine adding another,
though only a limited class of problems can make practical use of
thousands of CPUs. Each additional processor core on an SoC adds to the
area of the chip at least as much as it adds to the performance.
Multithreading a single processor can only improve performance up to
the level where the execution units are saturated. However, up to that
limit, it can provide a "superlinear" payback for the investment in die
size.
While the means and the motives are different, multi-core systems
and multithreaded cores have a common requirement that concurrency in
the workload be expressed explicitly by software. If the system has
already been coded in terms of multiple tasks running on a
multi-tasking OS, there may be no more work to be done.
Monolithic, single-threaded applications need to be reworked and
decomposed either into sub-programs or explicit software threads. This
work must be done for both multithreaded and multi-core systems, and
once completed, either can exploit the exposed concurrency - another
reason why the two techniques are often confused, and something that
makes them highly complementary.
When is Multi-core a Good Idea?
For embedded SoC designs, a multi-core design
makes the most sense when the functions of the SoC decompose cleanly
into subsystems with a limited need for communication and coordination
between them.
Instead of running all code on a single, large, high-frequency core,
connected to a single, large, high-bandwidth memory, assigning tasks to
several simpler, slower cores allows code and data can be stored in
per-processor memories, each of which has both a lower requirements for
capacity and bandwidth. That normally translates into power savings,
and potentially in area savings as well, if the lower bandwidth
requirement allows for physically smaller RAM cells to be used.
If the concurrent functions of an SoC cannot be statically
decomposed at system design time, an alternative approach is to emulate
general-purpose computers and build a coherent SMP cluster of processor
cores. Within such a cluster, multiple processors are available as a
pool to run the available tasks, which are assigned to processors "on
the fly".
The price to be paid for this flexibility is that it requires a
sophisticated interconnect between the cores and a shared main memory,
and the shared main memory needs to be relatively large and
high-bandwidth. This negates the area and power advantages alluded to
above for functionally partitioned multi-core systems, but can still be
a good trade-off.
Every core represents additional die area, and even in a "powered
down" standby state, each core in a multi-core configuration consumes
some amount of leakage current, so the number of cores in an SoC design
should in general be kept to the minimum necessary to run the target
application. There is no point in building a multi-core design if the
problem can be handled by a single core within the system's design
constraints.
When is Multithreading a Good Idea?
Multithreading makes sense whenever an application with some degree of
concurrency is to be run on a processor that would otherwise find
itself stalled a significant portion of the time waiting for
instructions and operands. This is a function of core frequency, memory
technology, and program memory reference behavior.
Well-behaved real-world programs in a typical single-threaded SoC
processor/ memory environment might be stalled as little as 30% of the
time at 500MHz, but less cache-friendly codes may be stalled a whopping
75% of the time in the same environment. Systems where the speeds of
processor and memory are so well matched that there is no loss of
efficiency due to latency will not get any significant bandwidth
improvement from multithreading.
Going Beyond Multi-Core
The additional resources of a multithreaded processor can be used for
other things than simply recovering lost bandwidth, if the
multithreading architecture provides for it. A multithreaded processor
can thus have capabilities that have no equivalent in a multi-core
system based on conventional processors.
For example, in a conventional processor, when an external interrupt
event needs to be serviced, the processor takes an interrupt exception,
where instruction fetch and execution suddenly restarts at an exception
vector. Interrupt vector code must save the current program state
before invoking the interrupt service code, and must restore the
program context before returning from the exception.
A multithreaded processor, by definition, can switch between two
program contexts in hardware, without the need for decoding an
exception or saving/restoring state in software. A multithreaded
architecture targeted for real-time applications can potentially
exploit this and allow for threads of execution to be suspended, then
unblocked directly by external signals to the core, providing for
zero-latency handling of interrupt events.
Multithreaded, Multi-core: The Best
of Both Worlds
Arguably, from the standpoint of area and energy efficiency, the
optimal SoC processor solution would be to use multithreaded cores as
basic processing elements, and to replicate them in a multi-core
configuration if the application demands more performance than a single
core can provide.
Kevin D. Kissell
is Principal Architect, MIPS
Technologies Inc.
To learn more about thi topic, go to
More
about multicores and multithreading.