Troubleshooting
This section of the users guide is designed to help users who may run into issues or have cluster configurations that may require some additional steps to enable.
Multi-Rail InfiniBand
Sometimes referred to as dual-rail; this is when you have two InfiniBand connections running from the same server on the same network plane (single subnet). When running InfiniBand multi-rail, Intel MPI Library needs a few additional variables passed to it in order to use both ports simultaneously. In our testing we saw ‘sockets’ being used for communication rather than InfiniBand.
export I_MPI_OFA_PORTS=1 export I_MPI_OFA_NUM_ADAPTERS=2
More details on the use of these environment variables can be found in this article
Custom Libfabric Provider
On systems with newer network fabrics, they may use a libfabric with Intel MPI that is not currently supported by Cluster Checker but has support with Intel MPI Library. To override Cluster Checker and use the supported libfabric provider. This will enable optimal MPI data to be collected, for example using IMB PingPong to understand if there are any deviations in network performance.
export I_MPI_OFI_PROVIDER=variable
Where ‘variable’ is replaced with an appropriate libfabric provider for the system. Using the command fi_info can provide you with insights about which providers are available for use on the server.