AN 456: PCI Express High Performance Reference Design

ID 683541
Date 12/12/2018
Public

1.1.3. Throughput for Reads

PCI Express uses a split-transaction for reads. A requester first sends a memory read request. The completer then sends an ACK DLLP to acknowledge the memory read request. It subsequently returns a completion data that can be split into multiple completion packets.
Read throughput is somewhat lower than write throughput because the data for the read completions may be split into multiple packets rather than being returned in a single packet. The following example illustrates this point. This example uses a read request for 512 bytes and a completion packet size of 256 bytes. The maximum possible throughput is calculated as follows:

Number of completion packets = 512/256 = 2

Overhead for a 3 dword TLP Header with no ECRC = 2*20 = 40 bytes

Maximum Throughput % = 512/(512 + 40) = 92%.

These calculations do not take into account any DLLPs and PLPs. The PCI Express Base Specification defines a read completion boundary (RCB) parameter. The RCB parameter determines the naturally aligned address boundaries on which a read request may be serviced with multiple completions. For a root complex, the RCB is either 64 bytes or 128 bytes. For all other PCI Express devices, the RCB is 128 bytes.

Note: A non-aligned read request may experience a further throughput reduction.

Read throughput depends on the round-trip delay between the following two times:

  • The time when the application logic issues a read request
  • The time when all of the completion data has been returned.

To maximize throughput, the application must issue enough read requests and process enough read completions. Or, the application must issue enough non-posted header credits to cover this delay.

The following figure shows timing diagram for memory read requests (MRd) and completions (CplD). The requester waits for a completion before making a subsequent read request, resulting in lower throughput.

Figure 3. Low Performance Reads Request Timing Diagram

The following timing diagram eliminates the delay for completions with the exception of the first read. This strategy maintains a high throughput.

Figure 4. High Performance Read Request TimingDiagram

The requester must maintain maximum throughput for the completion data packets by selecting appropriate settings for completions in the RX buffer. All versions of Altera’s PCIe IP cores offer five settings for the RX Buffer credit allocation performance for requests parameter. This parameter specifies the distribution of flow control header, data, and completion credits in the RX buffer. You should use this parameter to allocate credits to optimize for the anticipated workload.

A final constraint on the throughput is the number of outstanding read requests supported. The outstanding requests are limited by the number of header tags and the maximum read request size. The maximum read request size is controlled by the device control register (bits 14:12) in the PCIe Configuration Space. The Application Layer assign header tags to non-posted requests to identify completions data. The Number of tags supported parameter specifies number of tags available. A minimum number of tags are required to maintain sustained read throughput. This number is system dependent. On a Windows system, eight tags are usually enough to ensure continuous read completion with no gap for a 4 KByte read request. The High Performance Request Timing Diagram uses 4 tags. The first tag is reused for the fifth read.