Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference

ID 767253
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Intel® IEEE 754-2008 Binary Floating-Point Conformance Library and Usage

The Intel® IEEE 754-2008 Binary Floating-Point Conformance Library provides all operations mandated by the IEEE 754-2008 standard for binary32 and binary64 binary floating-point interchange formats. The minimum requirements for correct operation of the library are an Intel® Pentium® 4 processor and an operating system supporting Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions.

The library supports all four rounding-direction attributes mandated by the IEEE 754-2008 standard for binary floating-point arithmetic: roundTiesToEven, roundTowardPositive, roundTowardNegative, roundTowardZero. The additional rounding-direction attribute, roundTiesToAway, is not required by the standard, hence, not fully supported in this library. The default rounding-direction attribute is set as roundTiesToEven.

The library also supports all mandated exceptions (invalid operation, division by zero, overflow, underflow, and inexact) and sets flags accordingly under default exception handling. Alternate exception handling, which is optional in the standard, is not supported.

The bfp754.h header file includes prototypes for the library functions. For a complete list of the functions available, refer to the Function List. The user also needs to specify linker option -lbfp754 and floating-point semantics control option -fp-model strict in order to use the library.

Note: The libbfp754 library is not available for SYCL.

Many routines in the libbfp754 Library are more optimized for Intel® microprocessors than for non-Intel microprocessors.

Operations

The IEEE standard 754-2008 defines four types of operations.

  1. General-computational operations that produce correctly rounded floating-point or integer results. These operations might signal the floating-point exceptions.
  2. Quiet-computational operations that produce floating-point results. These operations do not signal any floating-point exceptions.
  3. Signaling-computational operations that produce no floating-point results. These operations might signal floating-point exceptions.
  4. Non-computational operations that produce no floating-point results. These operations do not signal floating-point exceptions.

  Produce result Produce no result

Might signal FP exception

General-computational

Signaling-computational

Do not signal FP exception

Quiet-computational

Non-computational

The standard also distinguishes among operations by their floating-point operand formats and result format for general-computational operations:

  1. Homogenous general-computational operations whose floating-point operands and floating-point result are in the same format.

  2. formatOf general-computational operations whose floating-point operands and floating-point result have different formats.

    NOTE:

    The IEEE 754-2008 standard requires that all formatOf general-computational operations be computed without any loss of precision before converting to the destination format. This may differ from how these operations are implemented on most hardware and software.

    For example, when all operands are in binary64 format and the destination format is binary32, most hardware and software implementations would first compute an intermediate result rounded in binary64 and then convert the intermediate result to binary32. This double rounding procedure may produce a result different from what is defined in the standard under certain rounding mode. For example: x = 0x3ff0000010000000 = 1.000000000000000000000001_2, y = 0x3ca0000000000000 = 1.0_2*2^(-53) x+y = 1.00000000000000000000000100000000000000000000000000001_2

    When the rounding-direction attribute is set to roundTiesToEven, using double rounding procedure, the addition result rounds to 1.000000000000000000000001_2 (0x3ff0000010000000) in binary64, which would then round to 1 (0x3f800000) in binary32. On the other hand, according to the standard, the addition result should round to 1.00000000000000000000001_2 (0x3f800001) in binary32.

Data Types

The following table correlates the names of the formats used in defining operations in the standard with their C99 types used in this library.

Format Name

Definition

C99 Type

binary32

IEEE 754-2008 binary32 interchange format

float

binary64

IEEE 754-2008 binary64 interchange format

double

int

Integer operand formats

int, unsigned int, long long int, unsigned long long int

int32

Signed 32-bit integer

int

uint32

Unsigned 32-bit integer

unsigned int

int64

Signed 64-bit integer

long long int

uint64

Unsigned 64-bit integer

unsigned long long int

boolean

Boolean value represented by generic integer type

int

enum

Enumerated values of floating-point class

int

Enumerated values of floating-point radix

int

logBFormat

Type for the destination of the logB operation and the scale exponent operand of the scaleB operation

int

decimalCharacterSequence

Decimal character sequence

char*

hexCharacterSequence

Hexadecimal-significand character sequence

exceptionGroup

Set of exceptions as a set of booleans

int

flags

Set of status flags

int

binaryRoundingDirection

Rounding direction for binary

int

modeGroup

Dynamically-specifiable modes

int

void

No explicit operand or result

void

Use the Intel® IEEE 754-2008 Binary Floating-Point Conformance Library

Many routines in the libbfp754 Library are more optimized for Intel® microprocessors than for non-Intel microprocessors.

To use the library, include the header file, bfp754.h, in your program.

Here is an example program illustrating the use of the library on Linux* OS.

You cannot use these libraries with SYCL kernels.

//binary.c
#include <stdio.h>
#include <bfp754.h>
int main(){
  double a64, b64;
  float c32;
  a64 = 1.000000059604644775390625;
  b64 = 1.1102230246251565404236316680908203125e-16;
  c32 = __binary32_add_binary64_binary64(a64, b64);
  printf("The addition result using the libary: %8.8f\n", c32);
  c32 = a64 + b64;
  printf("The addition result without the libary: %8.8f\n", c32);
  return 0;
}

To compile binary.c, use the command:

icx -fp-model source -fp-model except binary.c –lbfp754

The output of a.out will look similar to the following:

The addition result using the libary: 1.00000012
The addition result without the libary: 1.00000000

See Also