Intel® Acceleration Stack User Guide: Intel® FPGA Programmable Acceleration Card N3000-N/2

ID 683362
Date 11/01/2021
Public
Document Table of Contents

12.1. Using OPAE

The fpgad is a service that can help you protect the server from crashing when the hardware reaches an upper non-recoverable or lower non-recoverable sensor threshold (also called as fatal threshold). The fpgad is capable of monitoring each of the 20 sensors reported by the Board Management Controller.
$ sudo fpgainfo bmc
The Intel® FPGA PAC N3000-N/2 provides protective circuitry that automatically shuts down key board power supplies in the event of critical board sensors surpassing the fatal thresholds. The critical board sensors are listed below:
Table 10.  Critical Sensors
Sensor ID Sensor Upper Fatal Threshold Upper Warning Threshold Lower Fatal Threshold Lower Warning Threshold
12 FPGA Core Temperature 100°C 90°C X X
13 Board Temperature 100°C 85°C X X
25 12V Aux Voltage X X 10.56 V 11.40 V
3 12 V Backplane Voltage X X 10.56 V 11.40 V
For more information about sensors, refer to the Board Management Controller User Guide: Intel® FPGA Programmable Acceleration Card N3000-N/2 .
Note: Qualified OEM server systems should provide the required cooling for your workloads. Therefore, using fpgad may be optional.
When the opae-tools-extra-1.3.7-5.x86_64.rpm package is installed, fpgad is placed in the OPAE binaries directory (default: /usr/bin). The configuration file fpgad.cfg is located at /etc/opae. The log file fpgad.log which monitors fpgad actions is located at /var/lib/opae/.

The fpgad periodically reads the sensor values and if the values exceed the warning threshold stated in the fpgad.conf or the hardware defined warning threshold, it masks the PCIe Advanced Error Reporting (AER) registers for the Intel® FPGA PAC to avoid system reset.

Use the following command to start the fpgad service:
$ sudo systemctl start fpgad

The configuration file only includes the threshold setting for critical sensor 12 V Aux Voltage (sensor 25) and 12 V Backplane Voltage (sensor 3). These sensors do not have a hardware defined warning threshold and hence fpgad relies on the configuration file. The other two critical sensor FPGA Core Temperature (sensor 12) and Board Temperature (sensor 13) have a hardware defined warning threshold and fatal threshold set to values mentioned in the above table. The fpgad uses this information to mask the PCIe AER register when the sensor reaches the warning threshold.

Snapshot of the fpgad.cfg file located at /etc/opae/ which configures the sensor 12 V Aux Voltage (sensor 25) is shown below:
"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 25,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        },
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }
You must create another entry below the 12 V Aux Voltage entry for 12 V Backplane Voltage (sensor 3). The updated configuration file should have the following entry:
"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 25,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        }
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }, 

"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 3,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        }
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }
You can monitor the log file to see if upper or lower warning threshold levels are hit. For example:
tail -f /var/lib/opae/fpgad.log | grep “sensor.*warning”
fpgad-vc: sensor 'FPGA Die Temperature' warning

You must take appropriate action to recover from this warning before the sensor value reaches upper or lower fatal limits. On reaching the warning threshold limit, the daemon masks the AER registers and the log file will indicate that the sensor is tripped.

Sample output: Warning message when the FPGA Core Temperature exceeds the upper warning threshold limit.

Ex: tail -f /var/lib/opae/fpgad.log 
fpgad-vc: saving previous ECAP_AER+0x08 value 0x003ff030 for 0000:5d:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x000031c1 for 0000:5d:00.0
fpgad-vc: sensor 'FPGA Die Temperature' still tripped.
Sample output: Warning message when the voltage exceeds the lower warning threshold limit.:
fpgad-vc: sensor '12V AUX Voltage' warning.
fpgad-vc: saving previous ECAP_AER+0x08 value 0x00100000 for 0000:ae:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x00002000 for 0000:ae:00.0
fpgad-vc: sensor '12V AUX Voltage' still tripped.
fpgad-vc: sensor '12V AUX Voltage' still tripped.

If the upper or lower fatal threshold limit is reached, then a power cycle of server is required to recover the Intel® FPGA PAC N3000-N/2.

AER is unmasked by the fpgad after the sensor values are within the normal range which is above the lower warning or below the upper warning threshold.

Sample output when upper or lower fatal threshold is reached:
fpgad-vc: failed to read sensor xx
To stop fpgad:
$ sudo systemctl stop fpgad.service
To check status of fpgad:
$ sudo systemctl status fpgad.service
Optional: To enable fpgad to re-start on boot, execute
$ sudo systemctl enable fpgad.service
For a full list of systemctl commands, run the following command:
$ systemctl -h