Developer Guide and Reference

ID 767253
Date 10/31/2024
Public
Document Table of Contents

Programming Tradeoffs in Floating-Point Applications

In general, the programming objectives for floating-point applications fall into the following categories:

  • Accuracy: The application produces results that are close to the correct result.
  • Reproducibility and portability: The application produces consistent results across different runs, different sets of build options, different compilers, different platforms, and different architectures.
  • Performance: The application produces fast, efficient code.

Based on the goal of an application, you will need to make tradeoffs among these objectives. For example, if you are developing a 3D graphics engine, performance may be the most important factor to consider, with reproducibility and accuracy as secondary concerns.

The default behavior of the compiler is to compile for performance. Several options are available that allow you to tune your applications based on specific objectives. Broadly speaking, there are the floating-point specific options, such as the -fp-model (Linux*) or /fp (Windows*) option, and the fast-but-low-accuracy options, such as the [Q]imf-max-error option (host only). The compiler optimizes and generates code differently when you specify these different compiler options. Select appropriate compiler options by carefully balancing your programming objectives and making tradeoffs among these objectives. Some of these options may influence the choice of math routines that are invoked.

Use Floating-Point Options

The default behavior of the compiler is to use fp-model=fast. In this mode, lower-accuracy versions of the math library functions are chosen. For host code, this only affects calls that have been vectorized. For target code, the exact effect will vary depending on the target.

For GPU devices, using fp-model=fast enables lower-accuracy versions of the functions. Lower accuracy implementations are conformant with the OpenCL 3.0 specification. SVML functions with up to four ULPs (the equivalent to using -fimf-precision=medium) are used.

For FPGA devices, using fp-model=fast enables lower-accuracy versions of the functions, but there is no specific limit on the accuracy.

With fp-model=precise, the host code will use high accuracy implementations for both scalar and SVML calls. Target devices use implementations that conform to the SYCL specification, which isn't as accurate as host implementations, but is more accurate than with fp-model=fast.

Take the following code as an example:

float t0, t1, t2;
 ...
t0=t1+t2+4.0f+0.1f;

If you specify the -fp-model precise (Linux) or /fp:precise (Windows) option in favor of accuracy, the compiler generates the following assembly code:

movss     xmm0, _t1 
addss     xmm0, _t2 
addss     xmm0, DWORD PTR _Cnst4.0 
addss     xmm0, DWORD PTR _Cnst0.1 
movss     DWORD PTR _t0, xmm0
The assembly code follows the same semantics as the original source code.

If you specify the -fp-model fast (Linux) or /fp:fast (Windows) option in favor of performance, the compiler generates the following assembly code:

movss     xmm0, DWORD PTR _Cnst4.1 
addss     xmm0, DWORD PTR _t1 
addss     xmm0, DWORD PTR _t2 
movss     DWORD PTR _t0, xmm0

This code maximizes performance using Intel® Streaming SIMD Extensions (Intel® SSE) instructions and pre-computing 4.0f + 0.1f. It is not as accurate as the first implementation, due to the greater intermediate rounding error. It does not provide reproducible results because it must reorder the addition to pre-compute 4.0f + 0.1f. When fast-math is enabled, the ordering of operations is decided by the compiler. Different ordering may be used depending on the context, and not all compilers will choose the same ordering.

For many other applications, the considerations may be more complicated.

Tune Compilation Accuracy

In general, the -fp-model option provides control for accuracy. However, the compiler provides command-line options for an easy way to control the accuracy of mathematical functions and utilize performance/accuracy tradeoffs offered by the Intel math libraries that are provided with the compiler. These options are helpful in the following scenarios:

  • Use high-accuracy implementations while otherwise allowing fast-math optimizations
  • Use faster-but-less-accurate implementations while otherwise disabling fast-math optimizations

You can specify accuracy, via a command line interface, for all math functions or a selected set of math functions at a level more precise than low, medium, or high.

You specify the accuracy requirements as a set of function attributes that the compiler uses for selecting an appropriate function implementation in the math libraries. For example, use the following option to specify the relative error of two ULPs for all single, double, long double, and quad precision functions:

-fimf-max-error=2

To specify twelve bits of accuracy for a sin function, use:

-fimf-accuracy-bits=12:sin

To specify relative error of ten ULPs for a sin function, and four ULPs for other math functions called in the source file you are compiling, use:

-fimf-max-error=10:sin -fimf-max-error=4

On Windows systems, the compiler defines the default value for the max-error attribute depending on the /fp option setting. In /fp:fast mode the compiler sets a max-error=4.0 for the call. Otherwise, it sets a max-error=0.6.

For high-accuracy floating-point options on host code, use the --fimf-precision option. For high-accuracy floating-point options on device code, use the -f[no-]approx-func option.

On Windows use the /fp:precise option for more accurate floating-point SYCL operations.

Note that the OpenCL* standard provides guidelines for each floating-point operation, such as cos(), that specify the maximum ULP variance that a conforming device must support. For example, for cosine, a conforming device cannot have an ULP variance higher than 4. However, with the default fast floating-point operations, the ULP variance will likely be higher than what the OpenCL standard requires.

Dispatching of Math Routines

The compiler optimizes calls to routines from the libm and svml libraries into direct CPU-specific calls, when the compilation configuration specifies the target CPU where the code is tuned, and if the set of instructions available for the code compilation is not narrower than the set of instructions available in the tuning target CPU.

Note that except in the case of functions which return correctly-rounded results (<= 0.5 ulp error), you cannot rely on being able to obtain bitwise identical results from different device types. This is mainly due to differences in the implementation of math library functions which are optimized for the available instruction set on the device.

The use of floating-point options to require high accuracy implementations of the math library routines will reduce the impact of this problem, but not eliminate it. Depending on the algorithm used by the program being compiled, small errors may be compounded.

The use of less accurate implementations may amplify the differences. For example, if the cos() function is called with a four ULP error implementation, all devices will return a result that is within four ULP of the theoretically accurate result, but there is no guarantee that two different devices will return the same result within that error range.

See Also