CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Using nextgen PCI Express switches to eliminate network I/O bottlenecks



Embedded.com
Controllers in today's network-connected embedded systems often are overwhelmed by the data streaming to and from the various I/O sources; it can be difficult for the system's root complex to absorb high-speed bursty traffic such as 10Gig Ethernet when it competes with very fast streaming data from sources such as InfiniBand and Fibre Channel (FC) storage elements.

For example, when a few bytes of Ethernet data get stuck behind large packets of FC data in the root complex, the latency that is introduced by this congestion will severely impact system response time and create bandwidth limitations (see Table 1 below).

Table 1. Ethernet latency bandwidth tradeoffs

The next generation of PCI Express (PCIe) switches have added many new features to mitigate the effects of having to process competing data protocols, thereby improving overall system performance.

Advanced new features such as Read Pacing, enhanced port configuration flexibility, dynamic buffer memory allocation, and the deployment of PCIe Gen2 signaling are reducing I/O bottlenecks, providing dramatic improvements in system performance in server and storage controllers.

Performance Limited by "Endpoint Starvation"
When two or more endpoints are connected to a root complex through a PCIe switch, with unbalanced upstream versus downstream link-widths (and hence unbalanced bandwidths) and an uneven number of read requests are being made by the endpoints, one endpoint inevitably dominates the bandwidth of the root complex queue. The other endpoints suffer reduced performance as a result. This is known as "endpoint starvation," which can make it appear as if the system is congested and not performing optimally.

Figure 1 below shows a typical root complex connected to two endpoints through a PCIe switch. In this example, there is a x8 upstream port and two x4 downstream ports. The FC HBA is a good example of an endpoint that could dominate the bandwidth of the root complex queues.

In this example, the FC HBA makes several 2KB read requests, which are then queued by the root complex, filling up the queues in root complex.

Figure 1. Endpoint starvation

While the queues are full, the Ethernet NIC makes two 1KB read requests. The Ethernet NIC must wait for the root complex to service all of the read requests from the FC HBA before they're serviced. Thus the NIC is "starved."

Read Pacing "Feeds" the Starving Endpoint
Endpoint starvation is solved " and the endpoint is "fed" -- with a new PCIe switch feature called Read Pacing, which is available on the latest Gen 2 PCIe switches.

Read Pacing provides increased system performance with a more balanced allocation of bandwidth to the downstream ports of the switch. With Read Pacing, the switch can apply rules to prevent one port from overwhelming the completion bandwidth or buffering in the system.

Figure 2 below shows the same example, with a FC HBA and an Ethernet NIC on the downstream ports of a switch which aggregates traffic into a root complex. The FC HBA makes several 2KB read requests.

Figure 2. Read pacing eliminates endpoint starvation

With Read Pacing, the switch controls the number of the FC HBA's read requests forwarded through at a time. Programmable registers in the switch control the number of read requests forwarded to the root complex.

As the Ethernet NIC makes its two 1KB read requests, the switch allows both read requests through, thus balancing the flow of data from both endpoints. As shown in Figure 2, a 2KB read for the FC HBA through the root complex is immediately followed by two 1KB reads for the Ethernet NIC, resulting in balanced traffic for each endpoint.

Read Pacing allows the Ethernet NIC to be serviced more frequently without impacting the bandwidth of the FC HBA. Hence, endpoint starvation is eliminated with Read Pacing. The chart below compares the performance improvement that can be achieved with and without using Read Pacing in a real world system, where the FC issues 16 4K read requests ahead of the Ethernet single 1K read request.

Increase Performance by Optimizing Buffer Size Dynamically
Early PCIe switch architectures provided each port with a fixed amount of buffer RAM. Figure 3 below compares a typical type of buffer allocation, seen in the older switch designs, with the new Dynamic Allocation scheme found in the latest Gen2 switches.

Figure 3. Dynamic allocation leads to more buffers

In this example, a six-port switch is designed with a total of 30 packet buffers, with five buffer segments available on each port. If only four ports are used, then the buffers allocated to the two unused ports are wasted.

Since a larger buffer will translate into better performance, it would be nice if that unused memory could be used to increase the size of the buffers on the four ports that are being used.

In the latest Gen2 switches, it is possible to do just that. This feature is known as Dynamic Buffer Allocation, where a shared memory pool is available to any port, and the size of the buffer is allocated dynamically depending on the number of ports in use.

1 | 2

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready for a change?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :