Developer Guide and Reference

ID 767253
Date 10/31/2024
Public
Document Table of Contents

fsycl-pstl-offload

Enables the automatic offloading of C++ standard parallel algorithms to a SYCL device.

Syntax

Linux:

-fsycl-pstl-offload[=arg]

-fno-sycl-pstl-offload

Windows:

/fsycl-pstl-offload[:arg]

/fno-sycl-pstl-offload

Arguments

arg

Is one of the following:

cpu

Tells the compiler to perform offloading to a SYCL CPU device.

gpu

Tells the compiler to perform offloading to a SYCL GPU device.

Default

-fno-sycl-pstl-offload

C++ standard parallel algorithms are not offloaded automatically.

Description

This option enables the automatic offloading of C++ standard parallel algorithms that were called with std::execution::par_unseq policy to a SYCL device. The offloaded algorithms are implemented via the oneAPI Data Parallel C++ Library (oneDPL).

If you do not specify arg, it tells the compiler to perform offloading to the default SYCL device.

oneDPL is required for offloading support. See the oneDPL documentation for information about how to make it available in the environment.

NOTE:

When using this option, you must also specify option -fsycl.

The following are restrictions, requirements, and limitations when using option fsycl-pstl-offload:

  • Parallel algorithms callable objects restrictions

    Parallel algorithms callable objects have the same limitations as SYCL kernels:

    • Exceptions are not allowed.

    • Dynamic memory allocation is not allowed.

    • There can be unsupported API from std.

    For the complete list of kernel limitations, see the SYCL 2020 specification.

  • Data placement requirements

    • Only heap memory allocated with C++ standard dedicated facilities can be passed to the standard algorithms for offloading.

    • std::vector can also be used with parallel algorithms for offloading since it dynamically allocated memory underneath.

    • Stack allocated on the host cannot be used in offloaded parallel algorithms as well as std::array and C-style array on the stack. The solution for such a situation is to make a "deep copy" by capturing it in an algorithm callable by value or by allocating std::array or C-style array on the heap.

  • Other limitations:

    • Only a subset of standard C++ APIs can be used in parallel algorithms callable objects. For the complete list, see the oneDPL documentation on Tested Standard C++ APIs.

    • Option -fsycl-pstl-offload with the same argument must be applied to all Translation Units (TU) in an executable or a dynamic library.

Performance

If the performance is not satisfactory, the following environment variables may help:

  • Performance of memory allocations may be improved by using the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR environment variable.

  • Launch time performance of the algorithms may be improved by SYCL_CACHE_PERSISTENT environment variable.

For more information about these environment variables, see Environment Variables on GitHub.

IDE Equivalent

None

Alternate Options

None

Example

The following shows a way to use this option:

#include <algorithm>
#include <vector>
#include <execution>

int main()
{
    std::vector<int> v(1000000);

    // If this code is compiled with -fsycl-pstl-offload=gpu, the 
    // for_each algorithm is going to be offloaded to the default  
    // SYCL GPU device automatically
    std::for_each(std::execution::par_unseq, v.begin(), v.end(), [](auto& v)
    {
        // do some computation
    });
}