Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Obtaining Numerically Reproducible Results

Intel® oneAPI Math Kernel Library (oneMKL) offers functions and environment variables that help you obtain Conditional Numerical Reproducibility (CNR) of floating-point results when calling the library functions from your application. These new controls enable Intel® oneAPI Math Kernel Library (oneMKL) to run in a special mode, when functions return bitwise reproducible floating-point results from run to run under the following conditions:

  • Calls to Intel® oneAPI Math Kernel Library (oneMKL) occur in a single executable
  • The number of computational threads used by the library does not change in the run

For a limited set of routines, you can eliminate the second condition by using Intel® oneAPI Math Kernel Library (oneMKL) in strict CNR mode.

It is well known that for general single and double precision IEEE floating-point numbers, the associative property does not always hold, meaning (a+b)+c may not equal a +(b+c). Let's consider a specific example. In infinite precision arithmetic 2-63 + 1 + -1 = 2-63. If this same computation is done on a computer using double precision floating-point numbers, a rounding error is introduced, and the order of operations becomes important:

(2-63 + 1) + (-1) 1 + (-1) = 0

versus

2-63 + (1 + (-1)) 2-63 + 0 = 2-63

This inconsistency in results due to order of operations is precisely what the new functionality addresses.

The application related factors that affect the order of floating-point operations within a single executable program include selection of a code path based on run-time processor dispatching, alignment of data arrays, variation in number of threads, threaded algorithms and internal floating-point control settings. You can control most of these factors by controlling the number of threads and floating-point settings and by taking steps to align memory when it is allocated. However, run-time dispatching and certain threaded algorithms do not allow users to make changes that can ensure the same order of operations from run to run.

Intel® oneAPI Math Kernel Library (oneMKL) does run-time processor dispatching in order to identify the appropriate internal code paths to traverse for the Intel® oneAPI Math Kernel Library (oneMKL) functions called by the application. The code paths chosen may differ across a wide range of Intel processors and Intel architecture compatible processors and may provide differing levels of performance. For example, an Intel® oneAPI Math Kernel Library (oneMKL) function running on an Intel® Pentium® 4 processor may run one code path, while on the latest Intel® Xeon® processor it will run another code path. This happens because each unique code path has been optimized to match the features available on the underlying processor. One key way that the new features of a processor are exposed to the programmer is through the instruction set architecture (ISA). Because of this, code branches in Intel® oneAPI Math Kernel Library (oneMKL) are designated by the latest ISA they use for optimizations: from the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) to the Intel® Advanced Vector Extensions2 (Intel® AVX2). The feature-based approach introduces a challenge: if any of the internal floating-point operations are done in a different order or are re-associated, the computed results may differ.

Dispatching optimized code paths based on the capabilities of the processor on which the code is running is central to the optimization approach used by Intel® oneAPI Math Kernel Library (oneMKL). So it is natural that consistent results require some performance trade-offs. If limited to a particular code path, performance of Intel® oneAPI Math Kernel Library (oneMKL) can in some circumstances degrade by more than a half. To understand this, note that matrix-multiply performance nearly doubled with the introduction of new processors supporting Intel AVX2 instructions. Even if the code branch is not restricted, performance can degrade by 10-20% because the new functionality restricts algorithms to maintain the order of operations.

Numerically Reproducible Results for GPU Computations

oneMKL 2024.1 introduces CNR support for certain GPU computations: level-3 BLAS routines (for example, gemm and trsm) and batched versions of these routines. When any CNR code branch is enabled, GPU CNR support is enabled automatically, ensuring run-to-run bitwise reproducible results for these routines on Intel® GPUs. The specific code branch chosen (AVX2, AVX-512, and so forth) is ignored for GPU execution.

When GPU CNR mode is enabled, oneMKL guarantees that running the same computation multiple times on the same GPU will result in deterministic, bitwise-identical results. Two GPUs are considered to be the same if they have the same product name (for example, Intel® Arc™ A770). Note that results may differ across devices (between CPU and GPU, or between different kinds of GPUs).

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201