Visible to Intel only — GUID: vua1506229611776
Ixiasoft
Visible to Intel only — GUID: vua1506229611776
Ixiasoft
1.2.3. Memory and Cache Hierarchy
The CCI-P protocol provides a cache hint mechanism. Advanced AFU developers can use this mechanism to tune for performance. This section describes the memory and cache hierarchy for both the Intel® FPGA PAC and Integrated FPGA Platform. The CCI-P provided control mechanisms are discussed in the " Intel® FPGA PAC" and "Integrated FPGA Platform" sections, below.
Intel® FPGA PAC
- Processor Synchronous Dynamic Random Access Memory (SDRAM), referred to as host memory
- FPGA attached SDRAM, referred to as local memory
AFU requests targeted to CPU memory over PCIe, can be serviced by the Processor-side, as shown in Figure 4.
- A read request received has a lower latency than reading from the SDRAM (denoted (A.2)).
- A write request hint can be used to instruct the Last Level Cache how to treat the data written (for example: cacheable, non-cacheable, and locality).
If a request misses the Last Level Cache, it can be serviced by the SDRAM.
For more information, refer to the WrPush_I request in the CCI-P protocol definition.
Integrated FPGA Platform
- FPGA Cache (A.1)—Intel UPI coherent link extends the Intel® Xeon® processor’s coherency domain to the FPGA cache. Requests hitting in FPGA cache has the lowest latency and highest bandwidth. AFU requests that use VL0 virtual channel and VA requests that are selected to use UPI path, look up the FPGA cache first, and only upon a miss are sent off the chip to the processor.
- Processor-side cache (A.2)—A read request that hits the processor-side cache has higher latency than FPGA cache, but lower latency than reading from Processor SDRAM. A write request hint can be used to direct the write to processor-side cache. For more information, refer to WrPush_I request in CCI-P protocol definition.
- Processor SDRAM (A.3)—A request that misses the processor-side cache is serviced by the SDRAM.
The data access latencies increase from (A.1) to (A.3).
One limitation of the VC steering logic is that it does not factor the cache locality in the steering decision. The VC steering decision is made before the cache lookup. This means a request can get steered to VH0 or VH1 even though the cache line is in the FPGA cache. Such a request may incur an additional latency penalty, because the processor may have to snoop the FPGA cache in order to complete the request. If the AFU knows about the locality of accesses, then it may be beneficial to use VL0 virtual channel to exploit the cache locality.