Controllers in today's network-connected embedded systems often are
overwhelmed by the data streaming to and from the various
I/O sources;
it can be difficult for the system's root complex to absorb high-speed
bursty traffic such as 10Gig Ethernet when it competes with very fast
streaming data from sources such as InfiniBand and
Fibre Channel (FC)
storage elements.
For example, when a few bytes of Ethernet data get stuck behind
large packets of FC data in the root complex, the latency that is
introduced by this congestion will severely impact system response time
and create bandwidth limitations (see
Table 1 below).
 |
| Table
1. Ethernet latency bandwidth tradeoffs |
The next generation of PCI Express (PCIe) switches have added many
new features to mitigate the effects of having to process competing
data protocols, thereby improving overall system performance.
Advanced new features such as Read Pacing, enhanced port
configuration flexibility, dynamic buffer memory allocation, and the
deployment of PCIe Gen2 signaling are reducing I/O bottlenecks,
providing dramatic improvements in system performance in server and
storage controllers.
Performance Limited by "Endpoint
Starvation"
When two or more endpoints are connected to a root complex through a
PCIe switch, with unbalanced upstream versus downstream link-widths
(and hence unbalanced bandwidths) and an uneven number of read requests
are being made by the endpoints, one endpoint inevitably dominates the
bandwidth of the root complex queue. The other endpoints suffer reduced
performance as a result. This is known as "endpoint starvation," which
can make it appear as if the system is congested and not performing
optimally.
Figure 1 below shows a
typical root complex connected to two endpoints through a PCIe switch.
In this example, there is a x8 upstream port and two x4 downstream
ports. The FC HBA is a good example of an endpoint that could dominate
the bandwidth of the root complex queues.
In this example, the FC HBA makes several 2KB read requests, which
are then queued by the root complex, filling up the queues in root
complex.
 |
| Figure
1. Endpoint starvation |
While the queues are full, the Ethernet NIC makes two 1KB read
requests. The Ethernet NIC must wait for the root complex to service
all of the read requests from the FC HBA before they're serviced. Thus
the NIC is "starved."
Read Pacing "Feeds" the Starving
Endpoint
Endpoint starvation is solved " and the endpoint is "fed" -- with a new
PCIe switch feature called Read Pacing, which is available on the
latest Gen 2 PCIe switches.
Read Pacing provides increased system performance with a more
balanced allocation of bandwidth to the downstream ports of the switch.
With Read Pacing, the switch can apply rules to prevent one port from
overwhelming the completion bandwidth or buffering in the system.
Figure 2 below shows the
same example, with a FC HBA and an Ethernet NIC on the downstream ports
of a switch which aggregates traffic into a root complex. The FC HBA
makes several 2KB read requests.
 |
| Figure
2. Read pacing eliminates endpoint starvation |
With Read Pacing, the switch controls the number of the FC HBA's
read requests forwarded through at a time. Programmable registers in
the switch control the number of read requests forwarded to the root
complex.
As the Ethernet NIC makes its two 1KB read requests, the switch
allows both read requests through, thus balancing the flow of data from
both endpoints. As shown in Figure 2,
a 2KB read for the FC HBA through the root complex is immediately
followed by two 1KB reads for the Ethernet NIC, resulting in balanced
traffic for each endpoint.
Read Pacing allows the Ethernet NIC to be serviced more frequently
without impacting the bandwidth of the FC HBA. Hence, endpoint
starvation is eliminated with Read Pacing. The chart below compares the
performance improvement that can be achieved with and without using
Read Pacing in a real world system, where the FC issues 16 4K read
requests ahead of the Ethernet single 1K read request.
Increase Performance by Optimizing
Buffer Size Dynamically
Early PCIe switch architectures provided each port with a fixed amount
of buffer RAM. Figure 3 below compares
a typical type of buffer allocation, seen in the older switch designs,
with the new Dynamic Allocation scheme found in the latest Gen2
switches.
 |
| Figure
3. Dynamic allocation leads to more buffers |
In this example, a six-port switch is designed with a total of 30
packet buffers, with five buffer segments available on each port. If
only four ports are used, then the buffers allocated to the two unused
ports are wasted.
Since a larger buffer will translate into better performance, it
would be nice if that unused memory could be used to increase the size
of the buffers on the four ports that are being used.
In the latest Gen2 switches, it is possible to do just that. This
feature is known as Dynamic Buffer Allocation, where a shared memory
pool is available to any port, and the size of the buffer is allocated
dynamically depending on the number of ports in use.