Accuracy Versus Performance Tradeoffs in Floating-Point Computations
Programmers of floating-point applications typically aim for the following two objectives:
Accuracy: Produce results that are “close” to the result of the exact calculation.
Performance: Produce an application that runs as fast as possible.
The two objectives usually conflict. However, good programming practices and judicious use of compiler options allow you to control the tradeoffs.
For more information, see the article: Consistency of Floating-Point Results using the Intel Compiler or Why doesn’t my application always give the same answer?, by Dr. Martyn J. Corden and David Kreitzer (2018)
In this section, we present some mechanisms (compiler options and source-level changes) that allow the programmer to control the semantics (and hence the accuracy and performance) of floating-point computations on the host as well as on the device. We describe compiler options for OpenMP and SYCL programs, and describe source-level changes in SYCL.
OpenMP
In OpenMP, the following -fp-model options may be used to control the semantics of floating-point computations on the host as well as on the device.
-fp-model=precise
This option tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations. It disables optimizations (such as re-association, multiply by reciprocal, and zero folding) that can change the result of floating-point calculations. The increased accuracy that comes with -fp-model=precise may result in lower performance.
-fp-model=fast
This option is the default for both host and device compilations at -O2 and above. The option tells the compiler to use more aggressive optimizations when implementing floating-point calculations. These optimizations increase speed but may affect the accuracy or reproducibility of floating-point computations.
In C/C++, the -fp-model=fast option is equivalent to the -ffast-math option. With this option (at -O2 and above), all 7 fast-math flags (nnan, ninf, nsz, arcp, contract, afn, reassoc) are set by the C/C++ front-end. (See https://llvm.org/docs/LangRef.html#fast-math-flags for a description of the fast-math flags in LLVM.)
In Fortran, on the other hand, the language rules dictate that we cannot set the nnan flag (No NaNs) by default. So the -fp-model=fast option (at -O2 and above) only sets the other 6 fast-math flags (ninf, nsz, arcp, contract, afn, reassoc). To set all 7 fast-math flags in Fortran, use the -ffast-math option.
-Xopenmp-target-backend “-options -cl-fast-relaxed-math”
The -fp-model=fast (or -ffast-math) option does not enable native math instructions on the Intel GPU (Intel® Data Center GPU Max Series). You need to compile with -Xopenmp-target-backend “-options -cl-fast-relaxed-math” to get native math instructions on the GPU. Native math instructions give even lower accuracy than what is allowed with -fp-model=fast.
-Xopenmp-target-backend “-options -cl-fast-relaxed-math” passes the -cl-fast-relaxed-math option to the backend in the compilation tool chain for the device. -cl-fast-relaxed-math relaxes the precision of commonly used math functions on the device. It offers a quick way to get performance gains for kernels with many math library function calls, as long as the accuracy requirements are satisfied by what is provided through native math instructions.
The -cl-fast-relaxed-math option affects the compilation of the entire program and does not permit fine control of the resulting numeric accuracy.
Note that Intel GPUs support native for single precision (float, real) only.
Notes (OpenMP):
When -fp-model is specified on the compilation command line (outside of the -fopenmp-targets=spir64=”…” set of options), the -fp-model option applies to both the host and the device compilations. For example, the following compilation command specifies -fp-model=precise for both the host and the device compilations:
icx/icpx/ifx -O2 -fiopenmp -fopenmp-targets=spir64 -fp-model=precise
You can specify different ``-fp-model`` keywords for the host and
the device compilations as shown below.
To specify -fp-model=fast for the host compilation, and -fp-model=precise for the device compilation:
icx/icpx/ifx -O2 -fiopenmp -fopenmp-targets=spir64=”-fp-model=precise” -fp-model=fast
Or:
icx/icpx/ifx -O2 -fiopenmp -fopenmp-targets=spir64=”-fp-model=precise”
(No need to specify ``-fp-model=fast`` since it is the default at -O2 or higher.)
To specify -fp-model=precise for the host compilation, and -fp-model=fast for the device compilation:
icx/icpx/ifx -O3 -fiopenmp -fopenmp-targets=spir64=”-fp-model=fast” -fp-model=precise
To specify -fp-model=fast for the host compilation, and relaxed-math for the device compilation:
icx/icpx/ifx -O2 -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend “-options -cl-fast-relaxed-math” -fp-model=fast
Or:
icx/ifx -O2 -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend “-options -cl-fast-relaxed-math”
(No need to explicitly specify ``-fp-model=fast`` since it is the default at -O2 or higher.)
You can combine the fast, precise, and relaxed-math options with -fimf-precision=low (medium or high) option to fine-tune precision on the host side. The -fimf-precision option is not supported on the device side currently.
The following table shows a summary of how to set the various options (precise, fast-math, relaxed math) in OpenMP to (a) both the host and the device (second column); (b) to the host only (third column); and (c) to the device only (fourth column).
Floating Point Semantics |
Apply to Host and Device Compilations |
Apply to Host Compilation Only |
Apply to Device Compilation Only |
---|---|---|---|
Precise |
-fp-model=precise |
-fp-model=precise, and specify ``-fiopenmp -fopenmp-targets=spir64="-fp-model=fast" |
-fiopenmp -fopenmp-targets=spir64="-fp-model=precise" |
Fast-math |
-fp-model=fast (default) |
-fp-model=fast, and specify -fiopenmp -fopenmp-targets=spir64="-fp-model= precise" |
-fp-model=precise, and specify -fiopenmp -fopenmp-targets=spir64="-fp-model=fast" |
Relaxed-math (native instructions) |
Applies to device only |
Applies to device only |
-Xopenmp-target-backend "-options -cl-fast-relaxed-math" |
SYCL
In SYCL, as in OpenMP, the -fp-model=fast and -fp-model=precise options may be used for both host and device compilations.
In SYCL, -fp-model=fast option is equivalent to the -ffast-math option, and is the default for both host and device compilations at -O2 and above.
To specify relaxed-math for device compilation, use the compiler option -Xsycl-target-backend “-options -cl-fast-relaxed-math”. You need to compile with this option to get native math instructions on the GPU.
Note that SYCL (and Intel GPUs) support native for single precision (float, real) only.
Notes (SYCL):
When -fp-model is specified on the compilation command line (outside of any -Xsycl-target option), the -fp-model option applies to both the host and the device compilations. For example, the following compilation command specifies -fp-model=precise for both the host and the device compilations:
icx/icpx/ifx -fsycl -fp-model=precise
You can specify different ``-fp-model`` keywords for the host and
device compilations as shown below.
To specify -fp-model=fast for the host compilation, and -fp-model=precise for the device compilation:
icx/icpx -fsycl -Xsycl-target-frontend "-fp-model=precise" -fp-model=fast
Or:
icx/icpx -fsycl -Xsycl-target-frontend "-fp-model=precise"
(No need to specify ``-fp-model=fast`` since it is the default at -O2 or higher.)
To specify -fp-model=precise for the host compilation, and -fp-model=fast for the device compilation:
icx/icpx -fsycl -Xsycl-target-frontend "-fp-model=fast" -fp-model=precise
To specify -fp-model=fast for the host compilation, and relaxed-math for the device compilation:
icx/icpx -fsycl -Xsycl-target-backend “-options -cl-fast-relaxed-math” -fp-model=fast
Or:
icx/icpx -fsycl -Xsycl-target-backend “-options -cl-fast-relaxed-math”
(No need to specify ``-fp-model=fast`` since it is the default at -O2 or higher.)
The following table shows a summary of how to set the various options (math, fast-math, relaxed math) in SYCL to (a) both the host and the device (second column); (b) to the host only (third column); and (c) to the device only (fourth column).
Floating Point Semantics |
Apply to Host and Device Compilations |
Apply to Host Compilation Only |
Apply to Device Compilation Only |
---|---|---|---|
Precise |
-fp-model=precise |
-fp-model=precise, and specify -Xsycl-target-frontend "-fp-mode=fast" |
-Xsycl-target-frontend "-fp-model=precise" |
Fast-math |
-fp-model=fast (default) |
-fp-model=fast, and specify -Xsycl-target-frontend "-fp-model=precise" |
-fp-model=precise, and specify -Xsycl-target-frontend "-fp-model=fast" |
Relaxed-math (native instructions) |
Applies to device only |
Applies to device only |
-Xsycl-target-backend "-options -cl-fast-relaxed-math" |
Guidelines
In general, here are some guidelines for which options to use:
Do not specify inconsistent options. The result will be unpredictable.
The most commonly used option is -fp-model=fast for both host and device.
Use relaxed-math for best performance on the device.
Use -fp-model=precise for highest precision.
Example: The log Function
The math library (a component of a programming language’s standard library) contains functions (or subroutines) for the most common mathematical functions, such as exponential, logarithmic, power, and trigonometric functions.
Different implementations of the math library functions may not have the same accuracy or be rounded in the same way. The value returned by a math library function may vary between one compiler release and another, due to algorithmic and optimization changes.
The accuracy of a math library function can be controlled via compiler options or via the source code. We use the log (natural logarithm) math function as an example to illustrate this.
OpenMP / C++ (test_log_omp.cpp)
The following is an OpenMP C++ program that calls the std::log function on the device (from inside OpenMP target regions). The program includes the cmath header file which contains definitions for common math functions.
#include <iostream>
#include <assert.h>
#include <chrono>
#include <cmath>
#if FP_SIZE == 32
typedef float FP_TYPE;
static constexpr FP_TYPE VALIDATION_THRESHOLD = 1e-3;
#elif FP_SIZE == 64
typedef double FP_TYPE;
static constexpr FP_TYPE VALIDATION_THRESHOLD = 1e-6;
#endif
template<typename T>
void do_work (unsigned NELEMENTS, unsigned NREPETITIONS, T initial_value, T *res)
{
#pragma omp target teams distribute parallel for map(present,alloc:res[0:NELEMENTS])
for (unsigned j = 0; j < NELEMENTS; j++)
{
T tmp = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
tmp += std::log(tmp);
res[j] = tmp;
}
}
int main (int argc, char *argv[])
{
static constexpr unsigned NELEMENTS = 64*1024*1024;
static constexpr unsigned NREPETITIONS = 1024;
#pragma omp target
{ }
FP_TYPE initial_value = 2;
FP_TYPE ref_res = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
ref_res += std::log(ref_res);
std::cout << "reference result = " << ref_res << std::endl;
{
FP_TYPE * std_res = new FP_TYPE[NELEMENTS];
assert (std_res != nullptr);
std::chrono::duration<float, std::micro> elapsed;
#pragma omp target data map(std_res[0:NELEMENTS])
{
auto tbegin = std::chrono::system_clock::now();
do_work<FP_TYPE> (NELEMENTS, NREPETITIONS, initial_value, std_res);
auto tend = std::chrono::system_clock::now();
elapsed = tend - tbegin;
}
std::cout << "std::log result[0] = " << std_res[0] << std::endl;
bool allequal = true;
for (auto i = 1; i < NELEMENTS; ++i)
allequal = allequal and std_res[0] == std_res[i];
if (allequal)
{
if (std::abs(ref_res - std_res[0])/std::abs(ref_res) < std::abs(VALIDATION_THRESHOLD))
std::cout << "std::log validates. Total execution time is " << elapsed.count() << " us." << std::endl;
else
std::cout << "std::log does not validate (ref=" << ref_res << " std_res=" << std_res[0] << " mix=" << std::abs(ref_res - std_res[0])/std::abs(ref_res) << ")" << std::endl;
}
else
std::cout << "std::log does not validate, results are not equal." << std::endl;
delete [] std_res;
}
return 0;
}
Sample compilation and run commands for test_log_omp.cpp:
icpx -O2 -fiopenmp -fopenmp-targets=spir64 test_log_omp.cpp \
-DREAL_ELEMENT -DFP_SIZE=64 -fp-model=fast -fopenmp-version=51
OMP_TARGET_OFFLOAD=MANDATORY ./a.out
OpenMP / Fortran (test_log_omp_f_mod.f90)
The following is a OpenMP Fortran module that calls the Fortran intrinsic math function, log, on the device (from inside OpenMP target regions):
MODULE test
USE ISO_C_BINDING
CONTAINS
SUBROUTINE log_real_sp (nelements, nrepetitions, initial_value, res) bind(C,NAME='log_real_sp')
IMPLICIT NONE
INTEGER(KIND=C_INT), VALUE :: nelements, nrepetitions
REAL(C_FLOAT), VALUE :: initial_value
REAL(C_FLOAT) :: res(0:nelements-1), tmp
INTEGER :: i, j
!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO PRIVATE(tmp)
DO j = 0, nelements-1
tmp = initial_value
DO i = 0, nrepetitions-1
tmp = tmp + log(tmp)
END DO
res(j) = tmp
END DO
RETURN
END SUBROUTINE log_real_sp
SUBROUTINE log_real_dp (nelements, nrepetitions, initial_value, res) bind(C,NAME='log_real_dp')
IMPLICIT NONE
INTEGER(KIND=C_INT), VALUE :: nelements, nrepetitions
REAL(C_DOUBLE), VALUE :: initial_value
REAL(C_DOUBLE) :: res(0:nelements-1), tmp
INTEGER :: i, j
!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO PRIVATE(tmp)
DO j = 0, nelements-1
tmp = initial_value
DO i = 0, nrepetitions-1
tmp = tmp + log(tmp)
END DO
res(j) = tmp
END DO
RETURN
END SUBROUTINE log_real_dp
END MODULE test
Sample compilation command for test_log_omp_f.f90:
ifx -c -O2 -fiopenmp -fopenmp-targets=spir64 test_log_omp_f_mod.f90
SYCL (test_log_sycl.cpp)
In SYCL, you can control floating point semantics at the source-level by choosing which math function to call. For example, the SYCL program below calls the following three different versions of the log function:
std::log : Refers to the log function in the C++ standard library. The particular implementation chosen will be according to what the compiler options (-fp-model and -cl-fast-relaxed-math) prescribe. For example, to get the implementation that uses native math instructions, you need to compile with the -cl-fast-relaxed-math option.
sycl::log : Refers to the log function in the sycl namespace that is provided by SYCL. This function may use native instructions, even when the -cl-fast-relaxed-math option is not specified. Precision is spelled out by the SYCL specification.
sycl::native::log : Refers to the native log function in the sycl namespace that is provided by SYCL. This function uses native math instructions, and the -cl-fast-relaxed-math option is not needed. Note that SYCL (and Intel GPUs) support native for single precision (float, real) only. Precision is spelled out by the SYCL specification.
#include <CL/sycl.hpp>
#include <iostream>
#include <assert.h>
#include <chrono>
#include <cmath>
#if FP_SIZE == 32
typedef float FP_TYPE;
static constexpr FP_TYPE VALIDATION_THRESHOLD = 1e-3;
#elif FP_SIZE == 64
typedef double FP_TYPE;
static constexpr FP_TYPE VALIDATION_THRESHOLD = 1e-6;
#endif
template<typename T>
void do_work_std (sycl::queue &q, unsigned NELEMENTS, unsigned NREPETITIONS, T initial_value, T *res)
{
q.submit([&](sycl::handler& h) {
h.parallel_for(NELEMENTS, [=] (auto j)
{
FP_TYPE tmp = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
tmp += std::log(tmp);
res[j] = tmp;
});
}).wait();
}
template<typename T>
void do_work_sycl (sycl::queue &q, unsigned NELEMENTS, unsigned NREPETITIONS, T initial_value, T *res)
{
q.submit([&](sycl::handler& h) {
h.parallel_for(NELEMENTS, [=] (auto j)
{
FP_TYPE tmp = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
tmp += sycl::log(tmp);
res[j] = tmp;
});
}).wait();
}
# if FP_SIZE == 32
template<typename T>
void do_work_sycl_native (sycl::queue &q, unsigned NELEMENTS, unsigned NREPETITIONS, T initial_value, T *res)
{
q.submit([&](sycl::handler& h) {
h.parallel_for(NELEMENTS, [=] (auto j)
{
FP_TYPE tmp = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
tmp += sycl::native::log(tmp);
res[j] = tmp;
});
}).wait();
}
# endif
int main (int argc, char *argv[])
{
static constexpr unsigned NELEMENTS = 64*1024*1024;
static constexpr unsigned NREPETITIONS = 1024;
sycl::device d (sycl::gpu_selector_v);
sycl::queue q (d);
q.submit([&](sycl::handler& h) {
h.single_task ([=]() { });
}).wait();
FP_TYPE initial_value = 2;
FP_TYPE ref_res = initial_value;
for (unsigned i = 0; i < NREPETITIONS; ++i)
ref_res += std::log(ref_res);
std::cout << "reference result = " << ref_res << std::endl;
{
FP_TYPE * std_res = new FP_TYPE[NELEMENTS];
assert (std_res != nullptr);
std::chrono::duration<float, std::micro> elapsed;
{
auto * res = sycl::malloc_device<FP_TYPE>(NELEMENTS, q);
auto tbegin = std::chrono::system_clock::now();
do_work_std<FP_TYPE>(q, NELEMENTS, NREPETITIONS, initial_value, res);
auto tend = std::chrono::system_clock::now();
elapsed = tend - tbegin;
q.memcpy (std_res, res, NELEMENTS*sizeof(FP_TYPE)).wait();
sycl::free (res, q);
}
std::cout << "std::log result[0] = " << std_res[0] << std::endl;
bool allequal = true;
for (auto i = 1; i < NELEMENTS; ++i)
allequal = allequal and std_res[0] == std_res[i];
if (allequal)
{
if (std::abs(ref_res - std_res[0])/std::abs(ref_res) < std::abs(VALIDATION_THRESHOLD))
std::cout << "std::log validates. Total execution time is " << elapsed.count() << " us." << std::endl;
else
std::cout << "std::log does not validate (ref=" << ref_res << " std_res=" << std_res[0] << " mix=" << std::abs(ref_res - std_res[0])/std::abs(ref_res) << ")" << std::endl;
}
else
std::cout << "std::log does not validate, results are not equal." << std::endl;
delete [] std_res;
}
{
FP_TYPE * sycl_res = new FP_TYPE[NELEMENTS];
assert (sycl_res != nullptr);
std::chrono::duration<float, std::micro> elapsed;
{
auto * res = sycl::malloc_device<FP_TYPE>(NELEMENTS, q);
auto tbegin = std::chrono::system_clock::now();
do_work_sycl<FP_TYPE>(q, NELEMENTS, NREPETITIONS, initial_value, res);
auto tend = std::chrono::system_clock::now();
elapsed = tend - tbegin;
q.memcpy (sycl_res, res, NELEMENTS*sizeof(FP_TYPE)).wait();
sycl::free (res, q);
}
std::cout << "sycl::log result[0] = " << sycl_res[0] << std::endl;
bool allequal = true;
for (auto i = 1; i < NELEMENTS; ++i)
allequal = allequal and sycl_res[0] == sycl_res[i];
if (allequal)
{
if (std::abs(ref_res - sycl_res[0])/std::abs(ref_res) < std::abs(VALIDATION_THRESHOLD))
std::cout << "sycl::log validates. Total execution time is " << elapsed.count() << " us." << std::endl;
else
std::cout << "sycl::log does not validate (ref=" << ref_res << " sycl_res=" << sycl_res[0] << " mix=" << std::abs(ref_res - sycl_res[0])/std::abs(ref_res) << ")" << std::endl;
}
else
std::cout << "sycl::log does not validate, results are not equal." << std::endl;
delete [] sycl_res;
}
# if FP_SIZE == 32
{
FP_TYPE * sycl_native_res = new FP_TYPE[NELEMENTS];
assert (sycl_native_res != nullptr);
std::chrono::duration<float, std::micro> elapsed;
{
auto * res = sycl::malloc_device<FP_TYPE>(NELEMENTS, q);
auto tbegin = std::chrono::system_clock::now();
do_work_sycl_native<FP_TYPE>(q, NELEMENTS, NREPETITIONS, initial_value, res);
auto tend = std::chrono::system_clock::now();
elapsed = tend - tbegin;
q.memcpy (sycl_native_res, res, NELEMENTS*sizeof(FP_TYPE)).wait();
sycl::free (res, q);
}
std::cout << "sycl::native::log result[0] = " << sycl_native_res[0] << std::endl;
bool allequal = true;
for (auto i = 1; i < NELEMENTS; ++i)
allequal = allequal and sycl_native_res[0] == sycl_native_res[i];
if (allequal)
{
if (std::abs(ref_res - sycl_native_res[0])/std::abs(ref_res) < std::abs(VALIDATION_THRESHOLD))
std::cout << "sycl::native::log validates. Total execution time is " << elapsed.count() << " us." << std::endl;
else
std::cout << "sycl::native::log does not validate (ref=" << ref_res << " sycl_native_res=" << sycl_native_res[0] << " mix=" << std::abs(ref_res - sycl_native_res[0])/std::abs(ref_res) << ")" << std::endl;
}
else
std::cout << "sycl::native::log does not validate, results are not equal." << std::endl;
delete [] sycl_native_res;
}
# endif // FP_SIZE == 32
return 0;
}
Sample compilation and run commands for test_log_sycl.cpp:
icpx -fsycl -O2 test_log_sycl.cpp -DREAL_ELEMENT -DFP_SIZE=64 -fp-model=fast
OMP_TARGET_OFFLOAD=MANDATORY ./a.out
Performance Experiments
We present performance results when running the different programs (OpenMP C++, OpenMP Fortran and SYCL) that call the log function. On the particular Intel® Data Center GPU Max Series used (1-stack only), the performance of the log function, in single-precision was as follows.
Version |
Time (sec) |
---|---|
OpenMP/C++ (std::log) |
93,118 |
OpenMP/Fortran (log) |
94,342 |
SYCL (std::log) |
31,835 |
SYCL (sycl::log) |
31,644 |
SYCL (sycl::native::log) |
31,684 |
Version |
Time (sec) |
---|---|
OpenMP/C++ (std::log) |
93,181 |
OpenMP/Fortran (log) |
94,467 |
SYCL (std::log) |
31,657 |
SYCL (sycl::log) |
32,064 |
SYCL (sycl::native::log) |
31,452 |
Version |
Time (sec) |
---|---|
OpenMP/C++ (std::log) |
92,971 |
OpenMP/Fortran (log) |
94,444 |
SYCL (std::log) |
94,592 |
SYCL (sycl::log) |
94,852 |
SYCL (sycl::native::log) |
40,778 |
Version |
Time (sec) |
---|---|
OpenMP/C++ (std::log) |
35,251 |
OpenMP/Fortran (log) |
35,787 |
SYCL (std::log) |
31,314 |
SYCL (sycl::log) |
32,077 |
SYCL (sycl::native::log) |
32,141 |
Observations:
In OpenMP (C and Fortran): std::log follows what the compiler options (-fp-model, -cl-relaxed-math) prescribe. The -cl-relaxed-math option is needed to use native instructions for std::log.
In SYCL: sycl::log may use native instructions, even with just -fp-model=fast.
In SYCL: sycl::native::log will always use native machine instructions. The -cl-relaxed-math option is not needed.
In OpenMP and SYCL: When the -cl-relaxed-math option is specified, native machine instructions will be used for the log function on the device.
-fp-model=precise produces more accurate results, but the performance will be lower.
References
Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference - Alphabetical Option List
Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference - Xopenmp-target
OpenCL Developer Guide for Intel® Core and Intel® Xeon Processors
SYCL 2020 Specification - SYCL built-in functions for SYCL host and device
The OpenCL Specification, Version 1.2, Khronos OpenCL Working Group