Find Bugs Quickly Using Sanitizers with the Intel® oneAPI DPC++/C++...

Alina Shadrina, Software Technical Consulting Engineer

Igor Vorobtsov, Sr. Software Technical Consulting Engineer

Jan Fecht, Technical Student Intern

Robert Mueller-Albrecht, Product Marketing Manager

Intel Corporation

In this tutorial, you will learn how to use sanitizers with Intel oneAPI DPC++/C++ Compiler to efficiently and accurately detect common issues in your C/C++ code. Some of the features discusses here are also applicable to the Intel® Fortran Compiler, but the dicussion will focus primarily on the usage of sanitizers with C/C++.

Sanitizers help identify and pinpoint undesirable or undefined behavior in your code. They are enabled with a compiler option switch that instruments your program, adding additional safety checks to the binary.

As a side effect, there will be performance and file size overhead. The impact on performance and executable size depends on the specific sanitizer tool used and the program's characteristics being analyzed. Factors such as the program's size, the amount of memory allocated, and the number of threads used play a role. Thus, sanitizers should only be used for debugging and code verification, not in production code.

However, the benefits of using these separate sanitizer-enabled build steps are tremendous. They help to detect and prevent bugs and security vulnerabilities. Their use can be vital for regular software testing in a CI/CD DevOps environment.

They also provide a convenient way for software developers to verify code changes before submitting them to a repository branch.

In fact, sanitizers, as used with LLVM-based compilers like Clang* or the Intel® oneAPI DPC++/C++ Compiler, are fairly lightweight. This is especially true if you compare them with other open-source software testing solutions like Valgrind* or commercial code analyzer solutions for functional testing and coding standards compliance like Parasoft’s Insure++*, PVS Studio*, AbsInt Astrée*, or QA Systems Cantata*. Usually, sanitizers increase execution time by a factor of 2-3, while Valgrind can introduce overheads of up to 100x.

This makes sanitizers quite useful for testing or debugging a program as part of your regular software development flow or for identifying runtime issues that occur late in the execution of a larger application.

If, instead, you compare with more traditional interactive debug approaches like the use of GDB*, there does, of course, remain one drawback. The use of sanitizers requires recompilation of the program. Ideally, if your program depends on other shared libraries, these, too, should be recompiled with sanitizers enabled (except for the standard libc/libc++ or course). The benefit is that the code instrumentation will do the bug-hunting for you.

New with the Intel® oneAPI DPC++/C++ Compiler 2025.0 and Intel® Fortran Compiler 2025.0:
Find and resolve coding issues while you are writing your application. Now even more comprehensive feedback for
• Data Race Detection with ThreadSanitizer
• Memory Access Safety Issue Detection on SYCL accelerator offload code using DeviceSanitizer

New with the Intel® oneAPI DPC++/C++ Compiler 2025.1:
Find and resolve coding issues while you are writing your application. Now even more comprehensive feedback for
• Numerical Stability Sanitizer for C++ Code
• Memory and Address Sanitizers extended to support SYCL and OpenMP Offload Device Code
• Major improvements to Address Sanitizer for Device Code – invalid kernel argument detection, nullpointer detection, private memory support for openMP Offload

New with the Intel® Fortran Compiler 2025.1:
• Address Sanitizer extended to support openMP offload code

In this tutorial, we will take a closer look at the following sanitizers:

AddressSanitizer - detect memory safety bugs
UndefinedBehaviourSanitizer - detect undefined behavior bugs
MemorySanitizer - detect use of uninitialized memory bugs
ThreadSanitizer – detect data races
Numerical Stability Sanitizer – detect numerical (floating point) bugs
Device-Side AddressSanitizer – detect memory safety bugs in SYCL device code
Device-Side MemorySanitizer – detect use of uninitialized memory bugs and memory leaks in the device code

The example source code used throughout this tutorial can be found in the archive file sanitizers-tutorial_2025.1.tar.gz

1. Detecting Memory Safety Bugs With the AddressSanitizer

To demonstrate the different capabilities of sanitizers, we will use a small program that prints the Fibonacci sequence, a sequence in which each number is the sum of the two preceding ones starting with 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …

Code Sample

Let us start with the following code example, which can be found as fibonacci_v1.c inside the tutorial source archive.


#include <stdlib.h>
#include <stdio.h>

/**
 * Fill array arr of length n
 * with the first n fibonacci numbers
 */
void set_fibonacci_list(int *arr, int n) {
  arr[0] = 0;
  arr[1] = 1;

  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

/**
 * Print the first n fibonacci numbers
 * */
void print_fibonacci(int n) {
  int fibos[n];

  set_fibonacci_list(fibos, n);

  printf("Fibonacci Sequence\n");
  printf("==================\n");
  for (int i = 0; i < n; i++) {
    printf("%d\n", fibos[i]);
  }
  printf("==================\n");
  if (n > 1 && fibos[n-2] != 0) {
    printf("Golden ratio approximation: %g\n", ((double)fibos[n-1])/fibos[n-2]);
  }
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
    printf("Usage: %s NUM\n", argv[0]);
    return 1;
  }

  print_fibonacci(atoi(argv[1]));

  return 0;
}

Figure 1. Initial Fibonacci sequence example source code

This program takes the amount of Fibonacci numbers to print as a command line parameter. It then computes the Fibonacci sequence in the function set_fibonacci_list and prints it to the screen in the function print_fibonacci.

Running the Sanitizer

We will now use the AddressSantizer to detect potential memory-related bugs in this program. The AddressSanitizer can detect multiple memory safety bugs, including out-of-bounds accesses on the stack and heap and use-after-free bugs.

To compile the program using the AddressSantizer, use the following command:


$ icx src/fibonacci_v1.c -O0 -g -fsanitize=address -fno-omit-frame-pointer -o fibonacci_v1_with_asan

The compiler option -fsanitize=address activates the sanitizer.

Flags -O0 -g -fno-omit-frame-pointer are added to get the best diagnostic output in case we indeed find a coding issue, but these options are not mandatory.

Note that -g implicitly set -O0 and -fno-omit-frame-pointer. So, these options are only listed to provide you with the complete set of parameters.

There are additional sanitizer-related flags that you can pass to the command line. Please refer to the Clang Compiler User’s Manual for a complete list.

For comparison, we can also compile a version without the sanitizer:


$ icx src/fibonacci_v1.c -O0 -g -o fibonacci_v1

Now, you can run both executables with some value for N. They both should print the same output:

$ ./fibonacci_v1 10
Fibonacci Sequence
=====================================
0
1
1
2
3
5
8
13
21
34
=====================================
Golden ratio approximation: 1.61905
$ ./fibonacci_v1_with_asan 10
Fibonacci Sequence
=====================================
0
1
1
2
3
5
8
13
21
34
=====================================
Golden ratio approximation: 1.61905

However, the program contains a bug:

When n < 2, in the set_fibonacci_list function, we assign the initial Fibonacci values to indices that are out of bounds!

Let us try to run the programs with 0 as the argument and see what happens:

$ ./fibonacci_v1 0
Fibonacci Sequence
==================
$ ./fibonacci_v1_with_asan 0
====================================================
==9006==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffd21da5a20 at pc 0x000000506601 bp 0x7ffd21da5990 sp 0x7ffd21da5988                                 
...

This illustrates the power of the AddressSanitizer. Normal program execution did not fail in this example. So, we might have easily missed the bug. In other configurations, the program might have crashed. But the crash could have also happened at a later point. In the worst case, a program does not crash but produces wrong results!

AddressSantitizer, on the other hand, immediately detects the error and aborts the execution, showing a verbose diagnostic report. This report includes:

The type, location and register values of the bug:


dynamic-stack-buffer-overflow on address 0x7ffd21da5a20 at pc 0x000000506601 bp 0x7ffd21da5990 sp 0x7ffd21da5988

A traceback of where the bug occurred (here the debug compiler flags help):

WRITE of size 8 at 0x7ffd21da5a20 thread T0
    #0 0x506600 in set_fibonacci_list /home/user/sanitizer_tutorial/src/Fibonacci_v1.c:10:10
    #1 0x5067c1 in print_fibonacci /home/user/sanitizer_tutorial/src/fibonacci_v1.c:24:3
    #2 0x50691e in main /home/user/sanitizer_tutorial/src/fibonacci_v1.c:42:3
    #3 0x7feff05077b2 in __libc_start_main (/lib64/libc.so.6+0x237b2) (BuildId: ade58d86662aceee2210a9ef12018705e978965d)
    #4 0x41eb2d in _start (/home/user/sanitizer_tutorial/fibonacci_v1_with_asan+0x41eb2d)

Valgrind, on the other hand, did not detect the bug in this case. This is because Valgrind does not detect stack-based buffer overflows:

valgrind ./fibonacci_v1 0
==166581== Memcheck, a memory error detector
==166581== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==166581== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==166581== Command: ./Fibonacci_v1 0
==166581==
Fibonacci Sequence
==================
==================
==166581==
==166581== HEAP SUMMARY:
==166581==     in use at exit: 0 bytes in 0 blocks
==166581==   total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated
==166581==
==166581== All heap blocks were freed -- no leaks are possible
==166581==
==166581== For lists of detected and suppressed errors, rerun with: -s
==166581== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Fixing the Issue

Let us fix the memory safety bug we detected by adjusting the set_fibonacci_list function:


void set_fibonacci_list(int *arr, int n) {
  if (n > 1) {
    arr[0] = 0;
  }
  if (n > 2) {
    arr[1] = 1;
  }
  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

Figure 2. Fix in set_fibonacci_list function for n<2

You can find the new program in src/fibonacci_v2.c inside the tutorial source archive.

After recompilation, we can now check that the bug is gone by rerunning the sanitized version:

$ ./fibonacci_v2_with_asan 0
Fibonacci Sequence
=====================================

Perfect! We fixed the program.

2. Detecting Undefined Behavior With the UndefinedBehaviorSanitizer (UBSan)

After fixing memory-related bugs in the Fibonacci program, we can now do some more basic manual functional testing with the program.

Observing an Issue

For example, we can try to use larger values for N:

$ ./fibonacci_v2_with_asan 100
Fibonacci Sequence
==================
0
1
...
-889489150
==================
Golden ratio approximation: 9.81579

We can see that the output is wrong: The golden ratio seems very off, and Fibonacci numbers should never be negative!

Running the Sanitizer

To find out what is going wrong, let us now use the UndefinedBehaviorSanitizer (UBSan), a sanitizer that can detect types of undefined behavior in your program:


$ icx src/fibonacci_v2.c -O0 -g -fsanitize=undefined -fno-omit-frame-pointer -o fibonacci_v2_with_ubsan

Use the -fsanitize=undefined to enable UBSan. UBSan will catch a set of common undefined behavior types. Please refer to the UndefinedBehaviorSanitizer documentation to learn how to enable checks on different or additional undefined behavior types.

Let us run our sanitizer-enabled binary:

$ ./fibonacci_v2_with_ubsan 100
src/Fibonacci_v2.c:17:23: runtime error: signed integer overflow: 1836311903 + 1134903170 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/Fibonacci_v2.c:17:23 in
Fibonacci Sequence
==================
0
...

Identifying the Cause

UBSan has successfully identified the problem: The Fibonacci sequence is growing quickly and leads to a signed integer overflow, which is an undefined behavior according to the C standard.

Similar to the AddressSanitizer, we are getting verbose diagnostic output:

The type of undefined behavior (signed integer overflow)
Additional information about the problem (1836311903 + 1134903170 cannot be represented in type 'int')
The location of the bug (undefined-behavior src/Fibonacci_v2.c:17:23)

Note that in contrast to the AddressSantizer, the program is not aborted on detecting undefined behavior.

3. Detecting Uninitialized Memory Usage with the MemorySanitizer

The MemorySanitizer allows you to catch bugs caused by uninitialized memory usage. You can enable the sanitizer via the -fsanitize=memory flag.

Important Notes on MemorySanitizer Usage:
• The sanitizer does not fail immediately on uninitialized memory reads. It only fails once a branch, syscall, or dynamic call depends directly or indirectly on uninitialized memory.
• All project dependencies should be recompiled with MemorySanitizer. Otherwise, there might be high amounts of false positives

Let us look closer at how the MemorySanitizer can detect coding issues in our program.

One alternative way to fix the integer overflow bug from the previous section is to limit the amount of Fibonacci numbers computed (maximum of 47).

Let's say that naively, we add the limitation to our set_fibonacci_list function:


void set_fibonacci_list(int *arr, int n) {
  if (n > 1) {
    arr[0] = 0;
  }
  if (n > 2) {
    arr[1] = 1;
  }
  if (n > 47) {
    n = 47;
  }
  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

Figure 3. Limit the number of items in set_fibonacci_list function to n=47

You can find the newly updated program in src/fibonacci_v3.c inside the tutorial source archive.

Observing an Issue

Now we can re-compile and re-run the program:

$ icx src/fibonacci_v3.c -O0 -g -fsanitize=undefined -fno-omit-frame-pointer -o fibonacci_v3_with_ubsan

$ ./fibonacci_v3_with_ubsan 100
Fibonacci Sequence
==================
0
1
...
==================
Golden ratio approximation: -0.000100335
...

The good news is that UBSan does not complain anymore, meaning the program no longer contains a signed integer overflow. The bad news is that we still have negative Fibonacci numbers in our list, and the golden ratio approximation is still off.

Another observation is that the output of the tool changes non-deterministically.

This is a hint that there might be some uninitialized memory usage.

Running the Sanitizer

We can use the MemorySanitizer to double-check that. Use the following command to compile the program with the MemorySanitizer:


$ icx src/fibonacci_v3.c -O0 -g -fsanitize=memory -fsanitize-memory-track-origins=2 -fno-omit-frame-pointer -o fibonacci_v3_with_msan

The -fsanitize=memory flag enables the MemorySanitizer. To additionally track from which variable the uninitialized memory was derived, you can optionally pass the -fsanitize-memory-tracks-origins=2 flag.

Running the memory-sanitizer enabled program yields:

$ ./fibonacci_v3_with_msan 100
Fibonacci Sequence
==================
0
1
1
...
==177412==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4afd9e in print_fibonacci /home/user/sanitizer_tutorial/src/fibonacci_v3.c:36:5
    #1 0x4b03ec in main /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:53:3
    #2 0x7f1eac55a7b2 in __libc_start_main (/lib64/libc.so.6+0x237b2) (BuildId: ade58d86662aceee2210a9ef12018705e978965d)
    #3 0x41f2dd in _start (/home/user/sanitizer_tutorial/fibonacci_v3_with_msan+0x41f2dd)

  An uninitialized value was created by an allocation of 'vla' in the stack frame of function 'print_fibonacci'
    #0 0x4af6c0 in print_fibonacci /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:29:3

SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:36:5 in print_fibonacci
Exiting

Identifying the Cause

We can see that MemorySanitizer reports a use-of-uninitialized-value bug. The reason is that while we only fill the first 47 entries of the Fibonacci array, we still print and use the unassigned (and, thus, uninitialized) array values.

To fix the problem, we should add an extra check to the main function:


if (atoi(argv[1]) > 47) {
    printf("Please provide a number of 47 or less.\n");
    return 1;
  }

Figure 4. Add check for input parameter for Fibonacci sequence exceeding n=47

You can find the fixed resulting program in src/fibonacci_v4.c inside the tutorial source archive.

Congratulations, you have fixed all bugs in this program!

4. Detecting Data Races with the ThreadSanitizer

The ThreadSanitizer allows you to catch data races in OpenMP and threaded applications. You can enable the sanitizer via the -fsanitize=thread flag.

Important Notes on ThreadSanitizer Usage:
• ThreadSanitizer can detect data races only in runtime. So, it is recommended to increase code coverage with a wide variety of tests checking different inputs to execute each code path.
• Just like for Memory Sanitizer, all project dependencies should be recompiled with ThreadSanitizer. Otherwise, there might be high amounts of false positives
• Since ThreadSanitizer allocates a significant amount of virtual memory (without reserving it), it may change the behavior of tools like ulimit
• Static linking with the standard C or C++ libraries (libc/libstdc++) is not supported
• Non-position-independent executables are not supported

Code Sample

Let us start with the following code example, which can be found as MergesortOMP inside the tutorial source archive. The MergeSortOMP sample is a top-down implementation of a comparison-based sorting algorithm known as a merge sort.


void MergeSortOpenMP(int a[], int tmp_a[], int first, int last) {
  if (first < last) {
    int middle = (first + last + 1) / 2;  // = first + (last - first + 1) / 2;
    if (last - first < task_threshold) {
      MergeSort(a, tmp_a, first, middle - 1);
      MergeSort(a, tmp_a, middle, last);
    } else {
#pragma omp task
      MergeSortOpenMP(a, tmp_a, first, middle - 1);
#pragma omp task
      MergeSortOpenMP(a, tmp_a, middle, last);
#pragma omp taskwait
    }
    Merge(a, tmp_a, first, middle, last);
    my_counter++;
  }
}

Figure 5. Initial MergesortOMP example source code

This program has 3 test modes: it can run serial implementation, openMP task-based implementation, and both. It takes the test number as a user prompt. Serial implementation can be used as a reference, while OpenMP implementation contains a race condition.

Observing an Issue

Let us first compile a code without the sanitizer:


$ icpx -std=c++11 -g -qopenmp -o merge_sort_v1 src/merge_sort_v1.cpp

Now, we can run it both in serial and in OpenMP modes. They both should print the same output:

$ ./merge_sort
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
1

Serial version:
Shuffling the array
Sorting
Sort succeeded in 0.294302 seconds. 

$ ./merge_sort
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
Sort succeeded in 0.0802705 seconds.

Running the Sanitizer

We will now use the ThreadSanitizer to detect potential data race conditions in this program. To compile the program using the ThreadSantizer, use the following command:


$ icpx -std=c++11 -g -qopenmp -fsanitize=thread -fPIE -pie -o merge_sort_v1_tsan src/merge_sort_v1.cpp

The compiler option -fsanitize=thread activates the sanitizer, and -fPIE -pie flags generate a position-independent executable. Non-position-independent executables have fixed addresses for code and data. ThreadSanitizer’s instrumentation may need to insert extra checks and maintain metadata about memory accesses. This is more easily done when the code is position-independent, allowing it to be loaded at any address.

When the -fsanitize=thread flag is used, and the code is not already position-independent, the compiler will implicitly add the -fPIE flag for compilation and the -pie flag for linking. So, these options are only listed to provide the complete set of parameters.

However, the code contains a data race in the openMP implementation, which can be detected using ThreadSanitizer. Let us run the version compiled with the sanitizer support in the openMP mode:

$ TSAN_OPTIONS='ignore_noninstrumented_modules=1:halt_on_error=1:abort_on_error=1'  ./merge_sort_v1_tsan

N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
==================
WARNING: ThreadSanitizer: data race (pid=2193569)
  Write of size 4 at 0x555556a6dd28 by thread T27:
    #0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_v1_tsan+0xe9859)
    #1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_v1_tsan+0xea350)
    #2 __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1658:9 (libiomp5.so+0xf9623)
    #3 main.extracted /home/MergesortOMP/src/merge_sort_v1.cpp (merge_sort_v1_tsan+0xea4f6)
    #4 __kmp_invoke_microtask <null> (libiomp5.so+0x145602)
    #5 main /home/MergesortOMP/src/merge_sort_v1.cpp:231:1 (merge_sort_v1_tsan+0xea09e)

  Previous write of size 4 at 0x555556a6dd28 by thread T19:
    #0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_v1_tsan+0xe9859)
    #1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_v1_tsan+0xea350)
    #2 __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1658:9 (libiomp5.so+0xf9623)
    #3 main.extracted /home/MergesortOMP/src/merge_sort_v1.cpp (merge_sort_v1_tsan+0xea4f6)
    #4 __kmp_invoke_microtask <null> (libiomp5.so+0x145602)
    #5 main /home/MergesortOMP/src/merge_sort_v1.cpp:231:1 (merge_sort_v1_tsan+0xea09e)

  Location is global 'my_counter' of size 4 at 0x555556a6dd28 (merge_sort_v1_tsan+0x1519d28)

  Thread T27 (tid=2193609, running) created by main thread at:
    #0 pthread_create <null> (merge_sort_v1_tsan+0x38585)
    #1 __kmp_create_worker /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/z_Linux_util.cpp:974:7 (libiomp5.so+0x146126)

  Thread T19 (tid=2193601, running) created by main thread at:
    #0 pthread_create <null> (merge_sort_v1_tsan+0x38585)
    #1 __kmp_create_worker /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/z_Linux_util.cpp:974:7 (libiomp5.so+0x146126)

SUMMARY: ThreadSanitizer: data race /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 in MergeSortOpenMP(int*, int*, int, int)
==================

Aborted

First, let us closer look at the TSAN_OPTIONS environment variable. It consists of three values:

ignore_noninstrumented_modules=1 helps ThreadSanitizer to work with non-instrumented libraries. It will, for example, suppress reports from openMP runtime code
halt_on_error=1 forces the sanitizer to exit after the first reported error. It is particularly useful for debugging multithreaded applications, like openMP, because otherwise, the sanitizer will report the error for each thread, making the output very hard to read.
abort_on_error=1 with the option enabled, abort() will be called instead of _exit(). After printing the error report, only the first process encountering the error condition will be stopped, not the entire application.

Identifying the Cause

Now let us take a closer look at the ThreadSanitizer report:

The sanitizer was able to catch the data race:

WARNING: ThreadSanitizer: data race (pid=2193569)

And the location of the potentially incorrect code is:

#0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_tsan+0xe9859)
#1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_tsan+0xea350)

So, we should look at lines 112 and 106 of the source code:

…
(105) #pragma omp task
(106) MergeSortOpenMP(a, tmp_a, first, middle - 1);
… 
(112) my_counter++;

The sanitizer is pointing to line 112 at the top of the stack. So, we may need to learn more about my_counter:


(10) int my_counter = 0;

It is a global variable. Since this is a MergeSortOpenMP function, we see the issue happens inside a recursive call under the openMP task region. It means that multiple threads increment the same global variable. If you try running TSAN_OPTIONS='ignore_noninstrumented_modules=1’, you will see that the same issue is reported for many threads.

The #pragma omp atomic directive prevents simultaneous updates. The final program in merge_sort_v1.cpp inside the tutorial source archive.

Let us look at the MergeSortOpenMP function:


void MergeSortOpenMP(int a[], int tmp_a[], int first, int last) {
  if (first < last) {
    int middle = (first + last + 1) / 2;  // = first + (last - first + 1) / 2;
    if (last - first < task_threshold) {
      MergeSort(a, tmp_a, first, middle - 1);
      MergeSort(a, tmp_a, middle, last);
    } else {
#pragma omp task
      MergeSortOpenMP(a, tmp_a, first, middle - 1);
#pragma omp task
      MergeSortOpenMP(a, tmp_a, middle, last);
#pragma omp taskwait
    }
    Merge(a, tmp_a, first, middle, last);
    #pragma omp atomic
    my_counter++;
  }
}

Figure 6. Fixed MergesortOMP example source code (comments removed)

Now we are ready to compile the fixed version and run it once again under the sanitizer:

$ icpx -std=c++11 -g -qopenmp -fsanitize=thread -fPIE -pie -o merge_sort_v2_tsan src/merge_sort_v2.cpp 
$   TSAN_OPTIONS='ignore_noninstrumented_modules=1:suppress_equal_pcs=1'  ./merge_sort_v2_tsan
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
Sort succeeded in 0.419602 seconds.

The initial version of this code sample can be found at the oneAPI-samples GitHub repository: oneAPI-samples/DirectProgramming/C++/GraphTraversal/MergesortOMP.

5. Detecting Floating Point Issues with the NumericalStabilitySanitizer

In high-performance computing, numerical stability is crucial for ensuring accurate results, especially in complex calculations. The Numerical Stability Sanitizer (NSAN) helps detect precision issues in C/C++ code. It is introduced with Intel® oneAPI DPC++/C++ Compiler in version 2025.1 and integrated from the community.

Code Sample

Instead of using simple examples like the Fibonacci sequence, we turn to an N-body simulation code, which better represents scientific application challenges. This example involves simulating a distribution of point masses located at positions ( r_1, …, r_N ) with masses ( m_1, …, m_N ). The simulation computes the positions of these particles after a specified time using finite difference methods for ordinary differential equations. We are not planning to change the code this time but running it ‘as is’ first. To activate the NSAN for the C\C++ code, use the flag -fsanitize=numerical.

Running the Sanitizer

We will now use the NSAN to detect potential floating-point issues in this program not visible with a regular run:

$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -c GSimulation.cpp -o GSimulation.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -c main.cpp -o main.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -o nbody.x GSimulation.o main.o
$ ./nbody.x 
===============================
 Initialize Gravity Simulation
 nPart = 16000; nSteps = 10; dt = 0.1
------------------------------------------------
 s       dt      kenergy     time (s)    GFlops
------------------------------------------------
 1       0.1     26.405      0.077493    95.806
 2       0.2     313.77      0.079525    93.358
 3       0.3     926.56      0.079509    93.377
 4       0.4     1866.4      0.079574    93.3
 5       0.5     3135.6      0.079415    93.488
 6       0.6     4737.6      0.079538    93.343
 7       0.7     6676.6      0.079612    93.256
 8       0.8     8957.7      0.079643    93.22
 9       0.9     11587       0.079645    93.218
 10      1       14572       0.079611    93.257

# Number Threads     : 1
# Total Time (s)     : 0.7936
# Average Perfomance : 93.307 +- 0.086238
===============================

To compile and link the program using the NSAN, use the following command:

$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -fsanitize=numerical -c GSimulation.cpp -o GSimulation.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -fsanitize=numerical -c main.cpp -o main.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -fsanitize=numerical -o nbody.x GSimulation.o main.o

It is important to note that sanitizer options must be included in both the compiling and linking steps to include the necessary instrumentation and runtime support into the executable.

The sanitizer immediately detects issues:

$ ./nsan_nbody.x
===============================
 Initialize Gravity Simulation
WARNING: NumericalStabilitySanitizer: inconsistent shadow results while checking return value
float        precision  (native): dec: 0.01714134216308593750  hex: 0x1.18d80000000000000000p-6
double       precision  (shadow): dec: 0.01714138640090823174  hex: 0x1.18d82f80000000000000p-6
shadow truncated to float       : dec: 0.01714138640090823174  hex: 0x1.18d82f80000000000000p-6
Relative error: 0.00025807610457863139% (2^4 epsilons)
Absolute error: 0x1.7c000000000000000000p-25
(24 ULPs == 1.4 digits == 4.6 bits)
    #0 0x00000044c014 in float std::uniform_real_distribution<float>::operator()<std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>>(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) /usr/lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/random.h:2002:11
    #1 0x00000044292c in GSimulation::init_vel() /home/nbody-demo/ver6/GSimulation.cpp:68:27
    #2 0x00000044357b in GSimulation::start() /home/nbody-demo/ver6/GSimulation.cpp:117:3
    #3 0x00000044c194 in main /home/nbody-demo/ver6/main.cpp:43:7
    #4 0x0d688802a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #5 0x0d688802a28a in __libc_start_main csu/../csu/libc-start.c:360:3
    #6 0x0000004089a4 in _start (/home/nbody-demo/ver6/nsan_nbody.x+0x4089a4)

Exiting

The NSAN compares results from operations performed with different levels of precision. In this case, it's comparing the float precision used in the code with double precision (shadow). Float precision uses 32 bits, while double uses 64 bits, providing more precision.

Identifying the best Choice for Floating Point Precision

The sanitizer has detected a discrepancy between the results of operations performed with different levels of precision:

The value calculated using float precision is 0.01714134216308593750
The value calculated using double precision is 0.01714138640090823174
The double precision value is also shown truncated to float, which remains 0.01714138640090823174

The output also provides the hexadecimal representation of the numbers involved in the comparison. This representation further illustrates how the precision affects the values being compared.

The report provides a call stack trace showing where the issue was detected. The discrepancy originates from a call to std::uniform_real_distribution<float>::operator() in the C++ standard library's random number generation code. The problematic line of source code is within the GSimulation::init_vel() function, located at line 68 in GSimulation.cpp:

(68) particles->vel_z[i] = unif_d(gen) * 1.0e-3f;

The NSAN report suggests that developers should be aware of potential precision issues when using float in calculations that might benefit from the increased accuracy of double. In the file types.cpp the developer can control precision by defining real_type as an alias of data type:

(21) typedef float real_type;

Once we change it to double, we can see another NSAN report:

$ ./nsan_nbody.x
===============================
 Initialize Gravity Simulation
 nPart = 16000; nSteps = 10; dt = 0.1
------------------------------------------------
 s       dt      kenergy     time (s)    GFlops
------------------------------------------------
WARNING: NumericalStabilitySanitizer: inconsistent shadow results while checking store to address 0x7ffc57133e00
double       precision  (native): dec: -0.00004225954062234405  hex: -0x1.627fac263d1890000000p-15
__float128   precision  (shadow): dec: -0.00004225977345737984  hex: -0xb.14016136dab71b100000p-18
shadow truncated to double      : dec: -0.00004225977345737984  hex: -0x1.62802c26db56e0000000p-15
Relative error: 0.00055096139127912014% (2^35 epsilons)
Absolute error: 0x1.00013c7ca6c3f0000000p-32
(34360386533 ULPs == 10.5 digits == 35.0 bits)
    #0 0x00000044c6ab in GSimulation::start() /home/nbody-demo/ver6/GSimulation.cpp:184:18
    #1 0x00000045c7dc in main /home/nbody-demo/ver6/main.cpp:43:7
    #2 0x07ea5942a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #3 0x07ea5942a28a in __libc_start_main csu/../csu/libc-start.c:360:3
    #4 0x0000004089a4 in _start (/home/nbody-demo/ver6/nsan_nbody.x+0x4089a4)

Exiting

The issue was not reported against line 184 in GSimulation.cpp:

(184) acc_ztile[i-ii] += dz * G * particles->mass[j] * distanceInv * distanceInv * distanceInv;

Additionally, we observe that switching from float to double resulted in a 2x performance drop:

$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -c GSimulation.cpp -o GSimulation.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -c main.cpp -o main.o
$ icpx -g -std=c++11 -O2 -xCORE-AVX512 -o nbody.x GSimulation.o main.o
$ ./nbody.x 
===============================
 Initialize Gravity Simulation
 nPart = 16000; nSteps = 10; dt = 0.1
------------------------------------------------
 s       dt      kenergy     time (s)    GFlops
------------------------------------------------
 1       0.1     26.663      0.1794      41.385
 2       0.2     316.72      0.17833     41.632
 3       0.3     935.16      0.17837     41.623
 4       0.4     1883.6      0.17854     41.583
 5       0.5     3164.4      0.17841     41.614
 6       0.6     4781.1      0.17955     41.35
 7       0.7     6737.8      0.17956     41.346
 8       0.8     9039.7      0.17835     41.627
 9       0.9     11693       0.17834     41.63
 10      1       14705       0.17841     41.614

# Number Threads     : 1
# Total Time (s)     : 1.7873
# Average Perfomance : 41.548 +- 0.11643
===============================

Without losing performance, we cannot increase data type precision because floating-point computations naturally involve a tradeoff between precision and performance. Higher precision, such as using double or __float128, can provide more accurate results but often at the cost of increased computational resources and reduced performance. Conversely, using lower precision, like float, can enhance performance but may introduce larger numerical errors.

The discrepancy reported by the sanitizer is very small in relative terms and is typical when comparing double to __float128. We can continue using double since we do not observe specific issues in the application's results. In fact, we might even consider reverting to float if performance is a higher priority and precision requirements are not critical. Alternatively, we can explore solutions to restore performance while maintaining double precision, such as implementing Kahan summation to calculate the gravitational forces between particles and update their accelerations.

The Numerical Stability Sanitizer helps navigate the tradeoff between precision and performance by providing the relative error between different precision levels. Developers can make informed decisions based on their applications' requirements by analyzing these errors. The sanitizer's feedback might suggest using higher precision types for applications where precision is critical. In contrast, maintaining lower precision might be more appropriate for performance-sensitive applications where slight inaccuracies are acceptable. Ultimately, the choice depends on the application's nature and tolerance for numerical discrepancies.

6. Detecting Memory Safety Bugs in SYCL* Code with the Device-Side AddressSanitizer

The AddressSanitizer, a tool for detecting memory errors in C/C++ code, now includes support for SYCL device code. To activate this feature for the device code, use the flag -Xarch_device -fsanitize=address. The flag -Xarch_host -fsanitize=address should be used to identify memory access problems in the host code. This new SYCL accelerator extension thus provides a Device-Side AddressSanitizer. It is introduced with Intel® oneAPI DPC++/C++ Compiler in version 2025.0.

Note on Device-Side Address Sanitizer Usage:
• Device-side AddressSanitizer can increase private memory usage and trigger a UR_RESULT_ERROR_INVALID_WORK_GROUP_SIZE error. To resolve this, reduce the SYCL local workgroup size
• Device code AddressSanitizer forces in-order execution of kernels

Code Sample

Within the tutorial source archive, in the vector-add directory, there is a simple SYCL program designed to perform vector addition on a selected accelerator.

This example is presented in two distinct versions, each illustrating one of the two memory models available in SYCL. The initial version employs buffers, which are abstractions resembling containers for memory, and the SYCL runtime fully handles their management.


void VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector,
               IntVector &sum_parallel) {
  range<1> num_items{a_vector.size()};
  buffer a_buf(a_vector);
  buffer b_buf(b_vector);
  buffer sum_buf(sum_parallel.data(), num_items);

  for (size_t i = 0; i < num_repetitions; i++ ) {
    q.submit([&](handler &h) {
      accessor a(a_buf, h, read_only);
      accessor b(b_buf, h, read_only);
      accessor sum(sum_buf, h, write_only, no_init);
      h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
    });
  };
  // Wait until compute tasks on GPU are done
  q.wait();
}

Figure 7. vector-add-buffers example source code

The second model employs Unified Shared Memory (USM), which relies on malloc functions and offers developers greater control.


void VectorAdd(queue &q, const int *a, const int *b, int *sum, size_t size) {
  range<1> num_items{size+12};
  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  e.wait();
}

Figure 8. vector-add-usm example source code

Exploring the USM-Based Vector Addition Example

Let's begin by examining the USM code. As usual, we will first compile the code without the sanitizer to observe any potential issues up front and get an idea of what we may be looking for.


$ icpx -fsycl -g -O0 vector-add-usm_v1.cpp -o vector-add-usm_v1

Now, we can run it on GPU using Intel® oneAPI Level Zero backend:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
An exception is caught while adding two vectors.
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 11 (UR_RESULT_ERROR_INVALID_MEM_OBJECT)

Even though the output appears correct, it immediately indicates a runtime crash. To uncover and diagnose this issue, we will proceed to the next step.

Running the Sanitizer

We will now use the DeviceSanitizer to detect potential memory issues in this program. To compile the program with the DeviceSanitizer support, use the following command:


$ icpx -fsycl -g -O0 -Xarch_device -fsanitize=address vector-add-usm_v1.cpp -o vector-add-usm_v1_asan

The compiler option -Xarch_device -fsanitize=address activates the sanitizer for device code analysis. With the compilation complete, we can execute the binary to check for any memory issues:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000

====ERROR: DeviceSanitizer: out-of-bounds-access on Shared USM
READ of size 4 at kernel <typeinfo name for VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::'lambda'(auto)> LID(0, 0, 0) GID(10000, 0, 0)
  #0 auto VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::'lambda'(auto)::operator()<sycl::_V1::item<1, true>>(auto) const  /home/vector-add-usm_v1.cpp:62

The output confirms that the program is running on an Intel(R) Data Center GPU Max 1550 with a vector size of 10,000. However, the DeviceSanitizer has detected an out-of-bounds access error on Shared USM at vector-add-usm_v1.cpp:62.

With this information, we are now ready to investigate further and diagnose the root cause of the memory error reported by the DeviceSanitizer.

Identifying the Cause

To identify the cause of the memory error reported by the DeviceSanitizer, we must closely examine the error message's details and review the code's relevant sections. The error message provides us with specific clues: it indicates an out-of-bounds access during a read operation, and it points to the exact location in the code where the issue occurred:


(62)  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });

The code includes both kernel and host segments. An out-of-bounds error in this scenario might arise due to various factors, but two primary reasons are most common:

1. Incorrect Memory Allocation

If the arrays a, b, or sum were not allocated with the correct size before calling this kernel, any access beyond the allocated size would result in out-of-bounds access. Let us check this hypothesis and look at the allocations in the main function:


    int *a = malloc_shared<int>(array_size, q);
    int *b = malloc_shared<int>(array_size, q);
    int *sum_sequential = malloc_shared<int>(array_size, q);
    int *sum_parallel = malloc_shared<int>(array_size, q);

However, all 4 allocations are allocated with the same size, which has been passed to:


VectorAdd(q, a, b, sum_parallel, array_size);

2. Incorrect Range Specification:

If num_items is larger than the actual size of the array_size of a, b, or sum, the kernel will attempt to access elements beyond the allocated memory, leading to an out-of-bounds error. In this code, the range is num_items range specification is behind num_items:


range<1> num_items{size+12};

And this is exactly the bug. Let’s fix it and run with the sanitizer one more time:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan 
====ERROR: DeviceSanitizer: bad-free on address 0x387ae30
 #0 in main /home/vector-add-usm_v1.cpp:153:5
 #1 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xe5f3622a1ca)
 #2 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xe5f3622a28b)
 #3 in _start (./vector-add-usm_v1_asan+0x4034e5)

0x387ae30 may be allocated on Host Memory

A new bug is detected in vector-add-usm_v1.cpp:153:


(153) free(c, q);

To address this issue, we must examine how the array c was allocated:


(110) int *c = new int[array_size];

Since c is allocated using the C++ new operator rather than a SYCL memory allocation function, it should be deallocated using the corresponding C++ delete operator, not the SYCL free function.

After fixing the issue, we recompile and see another bad-free:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998

====ERROR: DeviceSanitizer: bad-free on address 0xc433b840400
 #0 in main /home/vector-add-usm_v1.cpp:155:5
 #1 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a1ca)
 #2 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a28b)
 #3 in _start (./vector-add-usm_v1_asan+0x4034e5)

0xc433b840400 is located inside of Shared USM region [0xc433b820400, 0xc433b82a040)
allocated here:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt /intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt /intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:112:25
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)

An exception is caught while adding two vectors.
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 45 (UR_RESULT_ERROR_INVALID_ARGUMENT)

Upon reviewing the code in vector-add-usm_v1.cpp:112, we observe the following lines:

(111)     int *sum_sequential = malloc_shared<int>(array_size, q);
(112)     int *sum_parallel = malloc_shared<int>(array_size, q);
...
(155)     free(sum_sequential, q2);

It is evident that the sum_sequential array is allocated using the malloc_shared function with the queue q, but an attempt is made to deallocate it using a different queue, q2. This discrepancy can lead to issues because the context within SYCL is crucial. Memory allocated in one context should be freed in the same context to ensure proper resource management and avoid potential errors.

While addressing the incorrect deallocation of sum_sequential using a different queue is important, it is not the final issue that needs to be resolved.

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan 
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

====ERROR: DeviceSanitizer: detected memory leaks of Shared USM
Direct leak of 40000 byte(s) at 0x12f842280400 allocated from:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler /../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:108:14
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)


====ERROR: DeviceSanitizer: detected memory leaks of Shared USM
Direct leak of 40000 byte(s) at 0x12f8422a0400 allocated from:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:107:14
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)

The output now indicates ‘Vector add successfully completed on device’, so the kernel completed successfully, but there seem to be two more bugs in the code. The sanitizer detected two memory leaks related to vector-add-usm_v1.cpp:107 and vector-add-usm_v1.cpp:108.

Lines are:

int *a = malloc_shared<int>(array_size, q);
int *b = malloc_shared<int>(array_size, q);

These are memory allocations causing the memory leak. The most common cause of memory leaks is simply forgetting to free the memory allocated with malloc_shared, malloc_device, or malloc_host. In SYCL, memory allocated with these functions must be explicitly deallocated using the corresponding sycl::free function. This is the end of the program:

free(sum_sequential, q);
free(sum_parallel, q);

The code currently lacks the deallocation for a and b. By including the appropriate free calls, the code executes as intended. The corrected code version can be found in the source file vector-add-usm_v2.cpp.

Congratulations, all the issues in vector-add based on USM are fixed.

Exploring the Buffer-Based Vector Addition Example

In addition, we can examine the version of this example that utilizes buffers. The kernel invocation from vector-add-buffers_v1.cpp has been previously illustrated in Figure 8. We will now proceed to compile and execute the code on an Intel GPU, utilizing the Level Zero backend.

$ icpx -fsycl -g -O0 vector-add-buffers_v1.cpp -o vector-add-buffers_v1
   $  ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-buffers_v1
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

While the code ran without any runtime failures and provided correct results, hidden memory issues might have gone undetected with the present compiler and driver version, and these could cause crashes with future versions of the compiler or runtime environment.

AddressSanitizer has uncovered an out-of-bounds memory access in the buffer:

$ icpx -fsycl -g -O0 -Xarch_device -fsanitize=address vector-add-buffers_v1.cpp -o vector-add-buffers_v1_asan
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-buffers_v1_asan 

Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10030

====ERROR: DeviceSanitizer: out-of-bounds-access on Memory Buffer
READ of size 4 at kernel <typeinfo name for VectorAdd(sycl::_V1::queue&, std::vector<int, std::allocator<int>> const&, std::vector<int,er&)::operator()(sycl::_V1::handler&) const::'lambda'(auto)> LID(4, 0, 0) GID(10000, 0, 0)
#0 auto VectorAdd(sycl::_V1::queue&, std::vector<int, std::allocator<int>> const&, std::vector<int, std::allocator<int>> const&, std:r&) const::'lambda'(auto)::ope /home/vector-add-buffers_v1.cpp:87

The line in question is 87:


(87) h.parallel_for(task_size, [=](auto i) { sum[i] = a[i] + b[i]; });

The task size is defined in line 72 as:


(72) auto task_size = num_items + 2*(i + 1);

The root cause of the issue is that the task size is incorrectly calculated; it should be exactly num_items, which is determined by the size of the array:


(59) range<1> num_items{a_vector.size()};

After correcting the task size to match num_items, the vector addition code based on buffers will function correctly. The correct code version can be found in vector-add-buffers_v1.cpp in the source code archive.
We have explored the role sanitizers play in identifying and resolving memory issues within both USM and buffer-based SYCL code, ensuring error-free parallel computing applications.

7. Detecting Uninitialized Memory and Memory Leaks in SYCL* and openMP* Code with the Device-Side MemorySanitizer

The MemorySanitizer, a tool for detecting uninitialized memory used in C/C++ code, now supports SYCL device code. To activate this feature for the device code, use the flag -Xarch_device -fsanitize=memory and set runtime variable UR_ENABLE_LAYERS=UR_LAYER_MSAN. The flag -Xarch_host -fsanitize=address should be used to identify memory access problems in the host code. This new SYCL accelerator extension thus provides a Device-Side MemorySanitizer. It is introduced with Intel® oneAPI DPC++/C++ Compiler in version 2025.1.

Note on Device-Side Memory Sanitizer Usage:
• Device-side MemorySanitizer is limited to USM device allocations

Code Sample

In the final step of "Detecting Memory Safety Bugs in SYCL Code" in chapter 6, we obtained a revised version of the file, vector-add-usm_v2.cpp, which has been enhanced with Device-Code AddressSanitizer. We will continue to use this updated file for further investigation, as it has suspicious memory leaks that require closer examination.

$ icpx -fsycl -g -O0 -Xarch_device -fsanitize=memory  -fno-omit-frame-pointer vector-add-usm_v2.cpp -o vector-add-usm_v2_msan.exe
$ UR_ENABLE_LAYERS=UR_LAYER_MSAN ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v2_msan.exe

====WARNING: DeviceSanitizer: use-of-uninitialized-value
use of size 4 at kernel <typeinfo name for VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::VectorAddCoeff> LID(0, 0, 0) GID(672, 0, 0)
#0 auto VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::'lambda0'(auto)::operator()<sycl::_V1::item<1, true>>(auto) const /home/vector-add-usm_v2.cpp:67

We have discovered the source of the problem: the array c was not initialized. Initializing c before its use is crucial to prevent undefined behavior and ensure the program functions correctly. It's important to note that c is a device USM allocation and should be initialized differently than a and b. To achieve this, we can easily copy values from b to c using sycl::memcpy:

q.memcpy(c, b, array_size * sizeof(int)).wait();

The finalized version of the USM code is available in the file vector-add-usm_v3.cpp.

Congratulations, all the issues in vector-add based on USM are fixed.

Summary and Next Steps

In this tutorial, we introduced you to the fundamentals of using sanitizers with the Intel oneAPI DPC++/C++ Compiler and Intel Fortran Compiler. Sanitizers help you catch multiple bugs in a simple program.

Using sanitizers can effectively catch issues early in the development process, saving time and reducing the likelihood of costly errors in production code.

Accelerate your software development and speed up your application's code correctness and functional safety testing by catching issues early. The Intel oneAPI DPC++/C++ Compiler and Intel Fortran Compiler enable you to do this for CPU and GPU accelerator offload code, giving you the confidence you desire to go into validation and production.

Useful Resources

Here are some detailed resources for you to explore the oneAPI DPC++/C++ Compiler:

Get The Software

You can install the Intel® oneAPI DPC++/C++ Compiler as a part of the Intel® oneAPI Base Toolkit or the Intel® oneAPI HPC Toolkit. You can also download a standalone version of the compiler or test it across Intel® CPUs and GPUs on the Intel® Developer Cloud platform.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Find Bugs Quickly Using Sanitizers with the Intel® oneAPI DPC++/C++ Compiler

Get the Latest on All Things CODE

1. Detecting Memory Safety Bugs With the AddressSanitizer

2. Detecting Undefined Behavior With the UndefinedBehaviorSanitizer (UBSan)

3. Detecting Uninitialized Memory Usage with the MemorySanitizer

4. Detecting Data Races with the ThreadSanitizer

5. Detecting Floating Point Issues with the NumericalStabilitySanitizer

6. Detecting Memory Safety Bugs in SYCL* Code with the Device-Side AddressSanitizer

Exploring the USM-Based Vector Addition Example

Exploring the Buffer-Based Vector Addition Example

7. Detecting Uninitialized Memory and Memory Leaks in SYCL* and openMP* Code with the Device-Side MemorySanitizer

Summary and Next Steps

Useful Resources

Get The Software

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Find Bugs Quickly Using Sanitizers with the Intel® oneAPI DPC++/C++ Compiler

Get the Latest on All Things CODE

1. Detecting Memory Safety Bugs With the AddressSanitizer

2. Detecting Undefined Behavior With the UndefinedBehaviorSanitizer (UBSan)

3. Detecting Uninitialized Memory Usage with the MemorySanitizer

4. Detecting Data Races with the ThreadSanitizer

5. Detecting Floating Point Issues with the NumericalStabilitySanitizer

6. Detecting Memory Safety Bugs in SYCL* Code with the Device-Side AddressSanitizer

Exploring the USM-Based Vector Addition Example

Exploring the Buffer-Based Vector Addition Example

7. Detecting Uninitialized Memory and Memory Leaks in SYCL* and openMP* Code with the Device-Side MemorySanitizer

Summary and Next Steps

Useful Resources

Get The Software

Product and Performance Information