Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Choosing the Best Configuration and Problem Sizes for CPUs

The performance of the Intel CPU Optimized HPCG depends on many system parameters including (but not limited to) the hardware configuration of the host and MPI implementation used. To get the best performance for a specific system configuration, choose a combination of these parameters:

  • The number of MPI processes per host node

  • The number of OpenMP* threads per MPI process

  • The local problem size

On Intel® Xeon® processor-based clusters, use the Intel AVX2 or Intel AVX-512 optimized version of the benchmark depending on the supported instruction set. For CPUs with one natural NUMA node per socket (up to 3rd generation Xeon® scalable processors), we recommend using one MPI process per CPU socket and one OpenMP* thread per physical CPU core skipping SMT threads. For CPUs from 4th generation Xeon® Scalable processors and beyond, there is often a more natural NUMA like division within each socket, and it is often best to use more MPI processes per socket matching these natural divisions. For instance, there are four dies in each socket of the Intel Xeon® Platinum 8480+ and 9480 processors corresponding to 14 physical cores (and for the 9480 model, an HBM stack attached to that die), and four MPI processes per socket can give top performance. In other processors, there is not a natural subdivision; however, even with an increasing number of cores per socket, it is sometimes worthwhile to increase the number of MPI processes per socket to reduce the number of OpenMP threads per MPI process, leading to better balance and performance. It is worth trying a single MPI process per socket with all OpenMP threads as well as multiple ranks per socket, with the number of OpenMP threads targeted to a range of 10–36 threads assigned per MPI process in order to find the best performance on the system.

For best performance, use a local problem size that is large enough to better utilize available cores, but not too large, so that all tasks fit the available memory. With modern CPUs, the last level cache (LLC) sizes per socket have grown immensely, so to comply with current HPCG benchmark requirements, the local problem size (nx x ny x nz) should also be chosen large enough so that the combined size of a vector per MPI process on the socket (each vector is nx*ny*nz*sizeof(double) bytes) should also not fit in the LLC.

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201