Summary
Troubleshooting steps for errors seen in the logs related to to power, power supplies, or fans
Description
Examples of error messages seen in the logs:
- PSU2, AC lost, AC removed.
- Non-Redundant, sufficient from insufficient. The system is not running in redundant power supply mode. This event is accompanied by specific power supply error Alternating Current (AC) lost.
- Non-Redundant, Insufficient. System is not running in redundant power supply mode.
Resolution
Step 1:
- Update the BIOS firmware (FW) to the latest version available (Version 22010091 or newer). There were fixes added to the power supply unit (PSU) firmware (FW) and Baseboard Management Controller (BMC) communication. You can refer to the BMC and Field Replaceable Unit and Sensor Data Record (FRUSDR) release notes.
- After the BIOS FW has been updated, if there are still PSU issues, follow Step 2 below.
Step 2:
Workaround: Multiple PSU failures detected
- If you see errors in the logs related to power, power supplies, or fans, note the color of the status Light-emitting diodes (LEDs) and check the sensors to see if the readings are normal or abnormal.
- The power supplies (PS1, PS2, PS3) should be within normal ranges for Input Power, Curr Out %, Inlet Temp, Temperature, and redundancy (2+1).
- If the sensor readings look abnormal, perform troubleshooting to see which of the suspect PSUs are actually bad by swapping them around.
- Does the problem follow the PSU swap?
- If the sensor readings look normal, but there are power-related errors in the logs, check the Status LEDs.
- If the PSUs have amber LEDs on all the time when running heavy workload, there is a workaround. Running the command below should make amber LED go away:
Command: Disable Power Supply Cold Redundancy. ipmitool raw 0x30 0x2d 0x01 0x00
- If running the command above does not solve the issues reported in the logs, and you have already cross-checked the PSUs (by swapping PSUs around around), but the LED is still amber, the suspect PSU will need to be replaced.
False Alarm: Nodes report AC lost
- Check for false alarms.
- If the amber LED does go away, but you still see AC lost error messages in the logs, check to see if the logs show errors logged by the slave node.