12.1. Using OPAE
$ sudo fpgainfo bmcThe Intel® FPGA PAC N3000-N/2 provides protective circuitry that automatically shuts down key board power supplies in the event of critical board sensors surpassing the fatal thresholds. The critical board sensors are listed below:
Sensor ID | Sensor | Upper Fatal Threshold | Upper Warning Threshold | Lower Fatal Threshold | Lower Warning Threshold |
---|---|---|---|---|---|
12 | FPGA Core Temperature | 100°C | 90°C | X | X |
13 | Board Temperature | 100°C | 85°C | X | X |
25 | 12V Aux Voltage | X | X | 10.56 V | 11.40 V |
3 | 12 V Backplane Voltage | X | X | 10.56 V | 11.40 V |
The fpgad periodically reads the sensor values and if the values exceed the warning threshold stated in the fpgad.conf or the hardware defined warning threshold, it masks the PCIe Advanced Error Reporting (AER) registers for the Intel® FPGA PAC to avoid system reset.
$ sudo systemctl start fpgad
The configuration file only includes the threshold setting for critical sensor 12 V Aux Voltage (sensor 25) and 12 V Backplane Voltage (sensor 3). These sensors do not have a hardware defined warning threshold and hence fpgad relies on the configuration file. The other two critical sensor FPGA Core Temperature (sensor 12) and Board Temperature (sensor 13) have a hardware defined warning threshold and fatal threshold set to values mentioned in the above table. The fpgad uses this information to mask the PCIe AER register when the sensor reaches the warning threshold.
"fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 25, "low-warn": 11.40, "low-fatal": 10.56 }, ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }
"fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 25, "low-warn": 11.40, "low-fatal": 10.56 } ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }, "fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 3, "low-warn": 11.40, "low-fatal": 10.56 } ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }
tail -f /var/lib/opae/fpgad.log | grep “sensor.*warning” fpgad-vc: sensor 'FPGA Die Temperature' warning
You must take appropriate action to recover from this warning before the sensor value reaches upper or lower fatal limits. On reaching the warning threshold limit, the daemon masks the AER registers and the log file will indicate that the sensor is tripped.
Sample output: Warning message when the FPGA Core Temperature exceeds the upper warning threshold limit.
Ex: tail -f /var/lib/opae/fpgad.log fpgad-vc: saving previous ECAP_AER+0x08 value 0x003ff030 for 0000:5d:00.0 fpgad-vc: saving previous ECAP_AER+0x14 value 0x000031c1 for 0000:5d:00.0 fpgad-vc: sensor 'FPGA Die Temperature' still tripped.
fpgad-vc: sensor '12V AUX Voltage' warning. fpgad-vc: saving previous ECAP_AER+0x08 value 0x00100000 for 0000:ae:00.0 fpgad-vc: saving previous ECAP_AER+0x14 value 0x00002000 for 0000:ae:00.0 fpgad-vc: sensor '12V AUX Voltage' still tripped. fpgad-vc: sensor '12V AUX Voltage' still tripped.
If the upper or lower fatal threshold limit is reached, then a power cycle of server is required to recover the Intel® FPGA PAC N3000-N/2.
AER is unmasked by the fpgad after the sensor values are within the normal range which is above the lower warning or below the upper warning threshold.
fpgad-vc: failed to read sensor xx
$ sudo systemctl stop fpgad.service
$ sudo systemctl status fpgad.service
$ sudo systemctl enable fpgad.service
$ systemctl -h