1.7.2. Throughput for Reads
PCI Express uses a split transaction model for reads. The read transaction includes the following steps:
- The requester sends a Memory Read Request.
- The completer sends out the ACK DLLP to acknowledge the Memory Read Request.
- The completer returns a Completion with Data. The completer can split the Completion into multiple completion packets.
Read throughput is typically lower than write throughput because reads require two transactions instead of a single write for the same amount of data. The read throughput also depends on the round trip delay between the time when the Application Layer issues a Memory Read Request and the time when the requested data returns. To maximize the throughput, the application must issue enough outstanding read requests to cover this delay.
The figures below show the timing for Memory Read Requests (MRd) and Completions with Data (CplD). The first figure shows the requester waiting for the completion before issuing the subsequent requests. Waiting results in lower throughput. The second figure shows the requester making multiple outstanding read requests to eliminate the delay after the first data returns. Eliminating delays results in higher throughput.
To maintain maximum throughput for the completion data packets, the requester must optimize the following settings:
- The number of completions in the RX buffer
- The rate at which the Application Layer issues read requests and processes the completion data
Read Request Size
Another factor that affects throughput is the read request size. If a requester requires 4 KB data, the requester can issue four, 1 KB read requests or a single 4 KB read request. The 4 KB request results in higher throughput than the four, 1 KB reads. The Maximum Read Request Size value in Device Control register, bits [14:12], specifies the read request size.
Outstanding Read Requests
A final factor that can affect the throughput is the number of outstanding read requests. If the requester sends multiple read requests to improve throughput, the number of available header tags limits the number of outstanding read requests. To achieve higher performance, Intel® Arria® 10 and Intel® Cyclone® 10 GX read DMA can use up to 16 header tags. The Intel® Stratix® 10 read DMA can use up to 32 header tags.