Visible to Intel only — GUID: vxm1506274816044
Ixiasoft
Visible to Intel only — GUID: vxm1506274816044
Ixiasoft
1.3.4. UMsg
UMsg provides the same functionality as a spin loop from the AFU, without burning the CCI-P read bandwidth. Think of it as a spin loop optimization, where a monitoring agent inside the FPGA cache controller is monitoring snoops to cache lines allocated by the driver. When it sees a snoop to the cache line, it reads the data back and sends an UMsg to the AFU.
UMsg flow makes use of the cache coherency protocol to implement a high speed unordered messaging path from CPU to AFU. This process consists of two stages as shown in Figure 8.
The first stage is initialization, this is where SW pins the UMsg Address Space (UMAS) and shares the UMAS start address with the FPGA cache controller. Once this is done, the FPGA cache controller reads each cache line in the UMAS and puts it as shared state in the FPGA cache.
Functionally, UMsg is equivalent to a spin loop or a monitor and mwait instruction on an Intel Xeon processor.
- Just as spin loops to different addresses in a multi-threaded application have no relative ordering guarantee, UMsgs to different addresses have no ordering guarantee between them.
- Every CPU write to a UMAS CL, may not result in a corresponding UMsg. The AFU may miss an intermediate change in the value of a CL, but it is guaranteed to see the newest data in the CL. Again it helps to think of this like a spin loop: if the producer thread updates the flag CL multiple times, it is possible that polling thread misses an intermediate change in value, but it is guaranteed to see the newest value.
- The UMsg uses the FPGA cache, as a result it can cause cache pollution, a situation in which a program unnecessarily loads data into the cache and causes other needed data to be evicted, thus degrading performance.
- Because the CPU may exhibit false snooping, UMsgH should be treated as a hint. That is, you can start a speculative execution or pre-fetch based on UMsgH, but you should wait for UMsg before committing the results.
- The UMsg provides the same latency as an AFU read polling using RdLine_S, but it saves CCI-P channel bandwidth which can be used for read traffic.