This document summarizes new and changed product features and includes notes about features and problems not described in the product documentation.
Where to Find the Release
Please follow the steps to download the toolkit from the Base Toolkit Download page and follow the installation instructions to install.
oneAPI 2022.3.1, Compiler Release 2022.2.1
Intel® oneAPI DPC++/C++ Compiler 2022.2.1 has been updated to include functional and security updates. Users should update to the latest version as it becomes available.
oneAPI 2022.3, Compiler Release 2022.2.0
New Features and Improvements
- OpenCL CPU RT will load TBB library with full path in Windows registry as precedence option, or read TBB path from the configuration file as a secondary option. Users who use the oneAPI installer will not be impacted since the oneTBB installer package in oneAPI will install the TBB registry key. Users who use the Compiler CONDA package will not be impacted either since the TBB path will be configured automatically in the configuration file. However, in other scenarios, OpenCL CPU RT will fail to load TBB since there is no TBB registry key installed and the TBB file in the configuration file is not set by default. The solution is:
- Configuration files are located in the same folder as the OCL RT file (intelocl64.dll). Please select the corresponding device configuration file and edit it, and make sure you have already configured the targeted device, otherwise, OCL RT will load two devices but use the first one by default.
- cl.cfg for CPU device
- cl.fpga_emu.cfg for FPGA emulator.
- At the bottom of the file, there is a field named “CL_CONFIG_TBB_DLL_PATH,” to which you can add your TBB DLL path. For example: CL_CONFIG_TBB_DLL_PATH = c:\your_TBB_install_path
- Users need to be aware when configuring cl.cfg and cl.fpga_emu.cfg
- If users configure both cl.cfg and cl.fpga_emu.cfg, the TBB location in them should be the same value.
- Quotation marks should not be added to the location value of CL_CONFIG_TBB_DLL_PATH.
- Configuration files are located in the same folder as the OCL RT file (intelocl64.dll). Please select the corresponding device configuration file and edit it, and make sure you have already configured the targeted device, otherwise, OCL RT will load two devices but use the first one by default.
- When using 'dpcpp' as the compiler driver on Windows, the expectation is for Linux command line compatibility and behaviors. An update has been performed for 'dpcpp' which improves the Linux compatibility and may impact expected behaviors if using 'dpcpp' for MSVC compatible command lines. When MSVC-compatible command line behaviors are desired, please use 'dpcpp-cl' as the compiler driver.
- Support OpenMP SIMD IF clause.
- Added initial support of
-lname
processing when searching for fat static libraries. - Added
-fsycl-fp32-prec-sqrt
flag which enables correctly roundedsycl::sqrt
. - Added support for
[[intel::loop_count()]]
attribute. - Added support for passing driver options to JIT compiler and linker.
- Added default argument support for
work_group_size_hint
attribute. - Added
-f[no-]sycl-device-lib-jit-link
option to control JIT linking of SYCL device libraries. - Added support for the new FPGA attribute
[[intel::fpga_pipeline(N)]]
for loop pipelining. - Added support for
sycl_ext_oneapi_properties
extension. - Added a mode for the Level Zero plugin where only the last command in each batch yields a host-visible event. Enabled this mode by default.
- Added support for an experimental Level Zero API for host pointer import into USM. The feature can be enabled using
SYCL_USM_HOSTPTR_IMPORT
environment variable. - Added support for the
wi_element
forbf16
type. - Added complex support for the reduce and scan group algorithms.
- Added
SYCL_RT_WARNING_LEVEL
environment variable which allows to control amount of warnings and performance hints the runtime library may print. - Added support for USM buffer location properties that allows specifying at what memory location the device USM allocation should be in.
- Added support for
buffer_location
property to thesycl::buffer
. - Added
single_task
support for ESIMD_EMULATOR backend. - Added support for SVM 1,2,4-elements gather/scatter for ESIMD.
- Added support for round-robin submissions to multiple compute CCS for the Level Zero backend. Disabled by default, can be controlled using
SYCL_PI_LEVEL_ZERO_USE_COMPUTE_ENGINE
. - Added support for buffer migration for contexts with multiple devices in the Level Zero plugin.
- Added mode where the Level Zero plugin uses immediate command lists instead of standard command lists. This mode is disabled by default and can be enabled using
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
environment variable. - Added reduction overloads accepting
span
. - Added LSC support for ESIMD_EMULATOR backend.
- Added
half
type support for__esimd_convertvector_to/from
. - Added support for the USM buffer location property in
malloc_shared
. - Added support for the USM buffer location property in
malloc_host
. - Added support for memory intrinsics for the ESIMD_EMULATOR plugin.
- Added support for named barrier APIs for ESIMD.
- Added support for DPAS API for ESIMD.
- Added support for LSC memory access APIs for ESIMD.
- Added support for the
invoke_simd
feature. - Added support for
info::device::atomic64
for OpenCL and Level Zero backends. - Added support for
sycl_ext_oneapi_usm_device_read_only
extension. - Added support for mapping/unmapping operations for the ESIMD_EMULATOR plugin.
- Added support for
make_buffer
API for the Level Zero backend. - Added missing
+-*/
operations forhalf
. - Added
ext_intel_global_host_space
in accordance withsycl_ext_intel_usm_address_spaces
extension. - Added aspect for
bfloat16
. - Introduced "Intel math functions" device library with support of type cast util functions for float, double, and integer types.
- Added
bfloat16
support forjoint_matrix
. - Added
sycl_ext_oneapi_complex_algorithms
extension. - Added
sycl_ext_oneapi_group_load_store
extension. - Added validation rules to the SPIR-V extension
SPV_INTEL_global_variable_decorations
. - Added
SYCL_INTEL_buffer_location
extension to supportbuffer_location
property for USM allocations. - Added experimental latency control API into
SYCL_INTEL_data_flow_pipes
. - Added
bfloat16
support to thefma
,fmin
,fmax
andfabs
SYCL floating point math functions intosycl_ext_oneapi_bfloat16
extension. - Implemented property set generation for device globals in the sycl-post-link. Added the
--device-globals
command-line argument for lowering and generating information about device global variables. - Introduced XPTI-based tools for SYCL applications: sycl-trace, sycl-prof, sycl-sanitize.
- Add support for tracing Level Zero API calls using XPTI and updated sycl-trace tool to be able to display both PI and Level Zero calls simultaneously.
- Added a diagnostic on an attempt to use zero-length arrays in the device code.
- Added support for consuming fat objects containing SPIR-V.
- Added support for generating SPRIV-based fat objects.
- Added a diagnostic on an attempt to use
-fsycl
and-static-libstdc++
together. This combination is not supported due to the runtime dependence with libsycl.so. - Added support for atomic loads and stores with various memory orders and scopes.
- Improved performance of accessing memory pointed by
sycl::accessor
for FPGA device. - Improved deferred diagnostics for usages within function templates in device code.
- Added support for
sycl_special_class
attribute to mark SYCL classes/structs that needs the additional compiler handling. - Improved driver to do device section checking only when offloading is enabled.
- Allowed calls to constant expression function pointers in device code.
- Disabled the passing code coverage and profiling options to device compilation.
- Added clang support of code location information for kernels.
- Disallowed explicit casts between mismatching address spaces.
- Added support of
[[sycl::device_has]]
attribute on the kernel. - Added a warning on the explicit cast from default address space to name.
- Added a warning for converting 'c' input to 'c++' in SYCL mode.
- Silenced unknown attribute warnings on host compilation.
- Added a diagnostic on an attempt to use
accessor::operator[]
in ESIMD code. - Expanded driver's ability to discover fat static archives after
/link
option on Windows. - Added support for saving user-specified names for lambda captures in kernel lambda object for FPGA target.
- Adjusted the compilation when preprocessing is requested to allow for the device compilation to fail and continue to perform the preprocessing steps.
- Added the ability to detect a kernel size mismatch in the case when the host and device compilers are different.
- Improved handling of specialization constants by backends.
- Improved support of
-mlong-double
options. - Improved
-save-temps
to allow optimization when performing a SYCL device compilation. - Removed warning diagnostic on host compilation when using
__attribute__((sycl_device))
. - Improved compiler to collect information for optimization record only if optimization record is saved by the user (i.e.
-fsave-optimization-record
or-opt-record-file
is passed). - Improved
[[intel::max_concurrency()]]
attribute support. - Added the new
kernel_arg_exclusive_ptr
metadata which guarantees that the kernel pointer argument, or pointers that derive from it, will not be dereferenced outside the current invocation of the kernel. - Improved deprecation messaging for options.
- Improved diagnostic behavior for
-fsanitize
with-fsycl
. - Added support for
sycl::ctz
API [d5eb769] - Improved the diagnostic for unresolved symbols in the device code for Level Zero backend.
- Added several arithmetic operations for
sycl::ext::oneapi::experimental::wi_element
. - Added
sycl::property_list
APIs tosycl::stream
. - Defined
sycl::access::decorated
in the SYCL headers. - Improved performance by allowing batching for the wait with barrier commands for Level Zero backend.
- Avoided JITing unnecessary device code when using
sycl::handler::set_specialization_constant
. - Updated image accessor constructor to make it possible to use const references in parallel_for.
- Relax the mutex lock duration in queue finish for the Level Zero backend to allow working with the queue from other threads.
- Added XPTI instrumentation for USM allocations.
- Extended XPTI information with buffer constructor data.
- Added error handling for
sycl::event::get_profiling_info()
. - Eliminated recursion and duplicated dependencies in leaf buffers handling in the scheduler.
- Improved runtime to emit program build logs when
SYCL_RT_WARNING_LEVEL
is set to 2 or higher. - Improved the error message at command execution failure.
- Improved runtime to build a program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.
- Improved
sycl::kernel::get_kernel_bundle
performance. - Changed USM pooling parameters for the Level Zero backend to boost performance.
- Exposed
value_type
andmin_capacity
from SYCL pipes extension class. - Improved thread safety of the Level Zero plugin by guarding access to the PI objects.
- Improved runtime to redirect warnings from using
SYCL_DEVICE_FILTER
withsycl-ls
tostd::cerr
. - Use new SPIR-V group operations within uniform control flow instead of non-uniform operations in SYCL headers.
- Enabled online linking of the device libraries.
- Improved esimd-verifier logic for detecting valid SYCL calls.
- Extended XPTI information with the kernel info.
- Added overload for
sycl::select(a, b, c)
wherec
is a bool. - Fixed batching-related thresholds to improve performance.
- Added always_inline for libdevice functions to enable which allows an underlying runtime to do inlining.
- Improved performance by caching the result of
zeKernelGetName
in the Level Zero plugin. - Updated the experimental latency control API to use property list and made the template argument approach deprecated.
- Enabled pooling of small USM allocations for the Level Zero backend to improve performance.
- Added managed memory check to enqueue prefetch, made it ignore the prefetch hint, and emit a warning if the memory provided is not managed.
- Enabled device code instrumentation by default.
- Optimized host event wait.
- Improved default selector to filter devices based on available device images.
- Enabled caching of native OpenCL and Level Zero executable binaries.
- Deprecated
sycl::ext::intel::ctz
extension functionsycl::ctz
from the core specification must be used instead. - Deprecated
ext_intel_host_device_space
which is replaced byext_intel_global_host_space
. - Added an option
--enable-esimd-emulator
to enable esimd emulator build using configure.py. - Added an ability to build plugins separately.
- Added
--enable-all-llvm-targets
switch to configure.py. - Added PI tracing support for
loadOsLibrary
. - Clarified the interaction between the
sycl_ext_oneapi_invoke_simd
extension andSYCL_EXTERNAL
functions. - Removed extensions specifications which were adopted to SYCL 2020. Please refer to extensions/removed/README for the list of removed extensions.
- Clarified which SPIR-V decorations the
sycl-post-link
tool generates for each device global variable. - Updated the design for device global variables for variables that are "shadowed" in an unnamed namespace.
- Clarified the specification that device global with
SYCL_EXTERNAL
is allowed. - Add an overview README for the extensions directory.
- Added a new rule for naming C++ identifiers in the SYCL project.
- Added ESIMD_EMULATOR to
SYCL_DEVICE_FILTER
description. - Clarified availability of
get_property()
. - Deprecated extended atomics extension.
- Added description of ESIMD_EMULATOR backend to sycl_ext_intel_esimd/README.
- Added support for the task_sequence extension in FPGA.
- Modified the latency control APIs that apply to pipe read/write and LSU load/store in FPGA.
- Enhanced the FPGA optimization report GUI.
- Any support, addition, renew, removal, fix, or deprecation of any features in SYCL 2020 conformant, please check SYCL 2020 Specification Features.
Bug Fixes
- There was a bug in DPC++/C++ compilers that they did not support linking library archives using the -l option for libraries that contain target offload code (i.e., offload code for GPU or FPGA). Instead of using the -l option, the linking command line had to specified the full path to the archive (file ending with .a). Furthermore, the developer had to guarantee that all required objects would be linked even if the libraries specified with the -l option were omitted from the command line. The simplest way to do this was to force linking of the whole archive. For example, if we assumed libbar.a in icx -o prog obj1.o obj2.o -L /path/to/libs -lfoo -lbar had target offload code, the command to link became icx -o prog obj1.o obj2.o -L /path/to/libs -lfoo -Wl,--whole-archive /path/to/libs/libbar.a -Wl,--no-whole-archive, where the -Wl,--no-whole-archive option was required after the user library even though no additional library was listed. This situation has been fixed for this release.
- Fixed a crash that occurred if an overloaded
new
operator was used in a recursive function in the device code. - Fixed macros being unavailable when using a custom host compiler.
- Fixed device code linking when one of the targets is not spir64 based.[1f8874f]
- Disabled part of SimplifyCFG optimizations in SYCL mode resulted in invalid optimizations in some cases.
- Silenced "unknown attribute" warning emitted during host part of full
-fsycl
compilation when it saw[[intel::device_indirectly_callable]]
attribute. - Removed incorrect assertion for use of
-fopenmp-new-driver
for multiple inputs. - Fixed problems where function pointers were captured as kernel arguments. [
- Fixed the error "Explicit load/store type does not match pointee type of pointer operand" caused by incorrect address space.
- Fixed incorrect diagnostic for
__spirv
calls when thereqd_sub_group_size
attribute is applied on a SYCL kernel. - Fixed alignment of emulated specialization constants.
- Fixed a crash that could happen when building a program for multiple devices.
- Fixed ambiguity error with
sycl::oneapi::experimental::this_nd_item
. - Fixed a performance issue caused by unnecessary command batching in the Level Zero plugin.
- Fixed an issue that might result in JITing for only one device while context is associated with multiple devices for Level Zero backend.
- Fixed namespace ambiguity in
this_id
,this_item
, andthis_group
. - Fixed two bugs in the Level Zero driver related to the static linking extension.
- Fixed return type of
get_nativesycl::backend::opencl(event)
fromcl_event
tovector<cl_event>
. - Modified Level Zero plugin support for copy engines to address scenarios when the main copy engine is unavailable.
- Fixed support for the query of USM capabilities.
- Fixed memory leak in the USM prefetch functionality.
- Fixed host device local accessor alignment.
- Fixed
sycl::errc values
for exceptions per SYCL 2020. - Fixed bug with
constexpr_recurse
usage. - Fixed
max_work_group_size
andreqd_work_group_size
attribute arguments check. - Fixed iterator debug level mismatch error on Windows when building programs with
/MDd
whenlibsycl-fallback-cassert.obj
is involved. - Fixed
get_native()
forsycl::event
per requirements of the specification. - Fixed device enumeration for the next platforms when the current platform doesn't have devices.
- Fixed thread-safety issue in the scheduler which can appear if a command gets cleaned up by another thread while adding a host accessor.
- Fixed
SYCL_PROGRAM_COMPILE_OPTIONS
andSYCL_PROGRAM_LINK_OPTIONS
to override compile and link options respectively. - Fixed incorrect handling of queue indexing for Level Zero backend.
- Fixed memory leaks in the reductions that require additional resources (such as buffers).
- Defined
get_property/has_property
in the queue forproperty::queue::in_order
. - Fixed memory leak in the scheduler for
run_on_host_intel
commands. - Fixed thread-safety issue caused by parallel access to the command list cache in the Level Zero plugin.
- Fixed device code outlining for static local variables to avoid invalid device code generation.
- Fixed dynamic batching in the Level Zero plugin.
- Fixed unsigned long warning in fallback cstring on Windows.
- Fixed sync of host task vs. kernel for the in-order queue.
- Fixed include dependency in
fpga_lsu.hpp
andpipes.hpp
headers. - Fixed kernel timestamp calculation in the Level Zero plugin.
- Fixed usage of copy-engines in the Level Zero interoperability queue.
- Fixed kernel execution hangs under large memory consumption by workarounding a bug in the Level Zero runtime.
- Fixed the Level Zero plugin to honor
property::queue::enable_profiling
. - Fixed memory leak which existed when program build failed for the Level Zero backend.
- Fixed buffer creation from rvalue iterator.
- Fixed
queue::device_has()
to private. - Fixed crash for the case when a device image has no kernels.
- Fixed dependency between host/device actions for unbundled FPGA-specific archives.
- Fixed
interop_handle::get_native_mem
so that it can work with accessors that use non-empty accessor_property_list. - Fixed sub-device count calculation for numa partitioning.
- Fixed
SYCL_ENABLE_PLUGINS
to enable both the OpenCL and the Level Zero PI plugins if it is unset. - Fixed BDF format on PCI query for the Level Zero backend.
- Fixed
sycl::queue
XPTI instrumentation. - Fixed interoperability return type for
sycl::buffer
tostd::vector<cl_mem>
per SYCL 2020. - Fixed
SYCL_DUMP_IMAGES
handling to also dump when spec constants are on. - Fixed failure in case of using zero-size local accessor on some backends.
- Fixed flaky bug which might appear in multi-threaded applications with simultaneous access to the cache of device lib programs.
- Fixed make_queue interoperability API for Level Zero to accept device argument to properly associate queue with the right device.
- Fixed invalid handler issue by updating OpenCL ICD loader from the community.
- Fixed "undefined symbol" error for
ldexpf
,hypotf
,frexpf
on SYCL GPU device using 3rd-party math headers instead of MSVC math headers on Windows. - Fixed memory leak for interop events created from the native handle.
- Fixed alignment of the memory returned from USM allocation functions.
- Fixed sporadic failure of the in-order queue due to non-closed batch on the Level Zero backend.
- Fixed possible deadlock in case of having dependent events from different queues in a multi-threaded application.
- Fixed issue with the delivery of assert message before aborting.
- Fixed default value for the
Alignment
template parameter of the usm_allocator. - Fixed API to get maximum width/height/depth of an image for the Level Zero backend.
- Fixed sycl-post-link tool to properly handle the offset in specialization constant descriptors.
- Fixed sycl-post-link tool to properly handle the padding at the end of composite types.
- Fixed translation of
Vector[Extract/Insert]Dynamic
instructions in llvm-spirv. - Fixed unconditional debug info generation for
libsycl_profiler_collector.so
. - Fixed sycl-post-link failure caused by incorrect removal of
llvm.used
in the case when specialization constant has 2+ users. - Removed extension to set kernel cache configuration.
- Disallowed
[[sycl_detail::uses_aspects()]]
attribute on type aliases in OptionalDeviceFeatures. - Moved
properties
and property-related APIs intosycl::ext::oneapi::experimental
.sycl_ext_oneapi_properties
specification was updated to revision 2. - Updated
sycl_ext_oneapi_kernel_properties
extension. - Aligned
sycl_ext_intel_kernel_args_restrict
extension extension with SYCL 2020. - Removed deprecated API from ESIMD headers.
- Renamed
wi_slice
towi_data
. - Renamed
nbarrier_*
API tonamed_barrier_*
for ESIMD. - Moved a part of ESIMD APIs outside of the experimental namespace.
- Moved
bfloat16
fromintel
namespace tooneapi
namespace. - Fixed an issue of FPGA compilation command option -Xsdsp-mode=<>, which used to fail when passed after some other -Xs command option.
- When compiling for FPGA, integer modulo operation for less than or equal to four bits ac_int and ac_fixed now uses 8-bits.
- Fixed an issue with FPGA optimization reports where the compiler did not render certain text characters included in the source file.
- Fixed an issue with FPGA compilation on a Linux system where the compiler could not detect the zlib library.
Known Issues and Limitations
- SYCL built-in group algorithms may produce wrong results on CPU or FPGA emulator devices if all of the following conditions are met:
- The work-group size on the highest dimension is larger than the sub-group size
- The group algorithm is applied to the work-group
- The group algorithm produces same result for all work-items in the work-group (e.g. all_of_group, any_of_group, group_broadcast, reduce_over_group)
- The group algorithm is used in a loop and the result may change due to change on inputs
For example, the following kernel code would produce wrong results (the while loop may not exit or acc[gid] may not be set for all work-items due to the known issue):
- SYCL 2020 barriers show worse performance than SYCL 1.2.1 do.
- It requires explicit linking against
lib/libsycl-fallback-cassert.o
orlib/libsycl-fallback-cassert.spv
when using fallback assert in separate compilation flow. - Limit alignment of allocation requests at 64KB which is the only alignment supported by Level Zero.
- On the following scenario on Level Zero backend:
- Kernel A, which uses buffer A, is submitted to queue A.
- Kernel B, which uses buffer B, is submitted to queue B.
queueA.wait()
.queueB.wait()
. DPCPP runtime is used to treat unmap/write commands for buffer A/B as host dependencies (i.e., they were waited for before enqueueing any command that's dependent on them). This allowed the Level Zero plugin to detect that each queue is idle on steps 1/2 and submit the command list immediately. This is no longer the case since we started passing these dependencies in an event waitlist and the Level Zero plugin attempts to batch these commands, so the execution of kernel B starts only on step 4. The workaround restores the old behavior in this case until this is resolved.
- User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e., an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
- A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
- The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
- Using
sycl::program
/sycl::kernel_bundle
API to refer to a kernel defined in another translation unit leads to undefined behavior - Linkage errors with the following message:
error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined
can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and the user specifies-std=c++14
or/std:c++14
. - Printing internal defines is not supported on Windows.
- The usage of new -ax (auto cpu dispatch) is not currently supported when building libraries with -fpic option.
- /Fo<file or dir/> flag no longer accepts directory arguments. Using this flag will result in an error message: clang-offload-bundler command failed with exit code 1. Fix is not available in this release.
- The compile times can be significant when compiling for FPGA and using a read-only accessor for a very wide struct. Using a read-write accessor instead is a workaround to address long compile times.
- When compiling for FPGA, you cannot use a system installed with Intel® FPGA PAC D5005 to compile a SYCL application that targets Intel® PAC with Intel® Arria® 10 FX FPGA. Compilation may succeed, but the compiled binary might fail at runtime. There is no workaround available for this issue currently.
-
When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
dpcpp -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o dpcpp -fintelfpga <other arguments> -Xshardware -kernel.o
-
When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
-
In the FPGA optimization report, the Loop Viewer (Alpha) can only handle loops with 100 iterations or less currently. For designs with loops greater than 100 iterations, the optimization reports hang. There is no known workaround for this issue.
-
The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
-
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp). -
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
-
On Windows system, when compiling for FPGA emulator flow, using -c to create object files, linking through to an archive file, and generating an executable from that archive might result in an executable that fails to launch device kernels. As a workaround for this issue, add the -fsycl-device-code-split=none flag to the archive step as shown in the following:
// generate .obj files dpcpp /EHsc -fintelfpga -c host.cpp device.cpp device_adder.cpp -DFPGA_EMULATOR // generate host.a dpcpp -fintelfpga -fsycl-link=image -fsycl-device-code-split=none host.obj device.obj device_adder.obj // generate .exe dpcpp -fintelfpga host.a /link /wholearchive // emulator executable host.exe
-
When compiling for FPGA, the compiler might produce a different intermediate representation (IR) on Windows than Linux. Misaligned structs cause this issue. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows. As a workaround, force an alignment on the misaligned structs, as shown in the following example:
//Code with misaligned struct struct Item { bool valid; int value1; unsigned char value2; }; //Forced alignment of the struct struct Item { bool valid; bool __empty__[3]; int value1; unsigned char value2; unsigned char __empty2__[3]; }
oneAPI 2022.2, Compiler Release 2022.1.0
New Features and Improvements
- Added support for the default context extension on Linux*, which returns the current default context for this platform.
- Added support for SYCL* sub-group mask extension that can be used to efficiently represent subsets of work items in a sub-group.
-
Implemented discard_events extension, which adds ext::oneapi::property::queue::discard_events property for sycl::queue. By using this property, the application informs a SYCL implementation that it will not use the event returned by any queue member functions. When the application creates a queue with this property, the implementation may be able to optimize some operations on the queue.
-
Implemented SYCL 2020 property traits.
- Implemented MAX_WORK_GROUP_QUERY extension to add functionally two new device information descriptors that provide the ability to query a device for the maximum numbers of work-groups submitted in each dimension and globally (across all dimensions).
- Added experimental support for SYCL group sorting algorithm:
- joint_sort uses the work items in a group to execute the corresponding algorithm in parallel.
- sort_over_group performs a sort overvalues held directly by the work-items in a group, and results returned to work-item i represent values that are in position i in the ordered range.
- Added support for debugging on the Intel® FPGA Emulation Platform for OpenCL™ software.
- Added support for FPGA simulation flow on Windows* system.
- Added support for aocl binedit utility to extract useful information about the compiled FPGA binary.
- Added support for latency controls to specify an exact, minimum or maximum latency between read and write accesses on memories and pipes in FPGA.
- Added support for -Xsdsp-mode=<option> to control FPGA hardware implementation of the supported data types and math functions.
- Added support for sycl::ext::intel::fpga_loop_fuse<v>(f) and sycl::ext::intel::fpga_loop_fuse_independent<v>(f) functions, which allow fusing adjacent loops in the FPGA code block overriding the compiler profitability analysis of the fusion.
Bug Fixes
- Fixed a SYCL driver issue concerning the device binary image, which gets corrupted when you use a two-step AOT build, and there is at least a single call to the devicelib function from within the kernel.
Known Issues and Limitations
- Having MESA OpenCL implementation, which provides no devices on a system, may cause incorrect device discovery. You can disable an OpenCL implementation by removing the /etc/OpenCL/vendor/mesa.icd as a workaround.
- Compilation of a SYCL* program may fail on Windows in debug mode if a kernel uses std::array. This is a limitation that we are not planning to resolve. A workaround is to use sycl::buffer, which captures data() of std::array and accesses the data in SYCL kernel through sycl::accessor. You can perform the following as an alternate workaround:
- Dump compiler pipeline execution strings by passing the -### option to the compiler. The compiler prints the internal execution strings of compilation tools. The actual compilation does not happen.
- Modify the (usually) first execution string (it should have -fsycl-is-device option) by adding -D_CONTAINER_DEBUG_LEVEL=0 -D_ITERATOR_DEBUG_LEVEL=0 options to the end of the string. Execute all the strings one by one.
- -fsycl-dead-args-optimization cannot help eliminate the offset of an accessor even though it is created with no offset specified.
- Using sycl::queue::prefetch API on Windows might lead to failure due to issues with cuMemPrefetchAsync.
- Default context does not bind to sub-devices created from the root device the context is bound to. You can create a context explicitly using all required sub-devices as a workaround.
- Using forward references within a class member of an array type in SYCL device code may result in the segmentation fault. Add static in front of the array to prevent this crash as a workaround.
- DPC++ Compiler does not work together with Windows SDK for Windows 11. The latest supported Windows SDK is Windows 10 SDK version 2104 (10.0.20348.0).
- To run sys_check by the Diagnostics Utility for Intel® oneAPI Toolkits, one has to do one of the following:
- Add the path to the sys_check file to the DIAGUTIL_PATH environment variable manually:
export DIAGUTIL_PATH=/opt/intel/oneapi/compiler/latest/sys_check/sys_check.sh:$DIAGUTIL_PATH; diagnostics.py --filter sys_check - Use -p option of the Diagnostics Utility for Intel® oneAPI Toolkits to add the path to the sys_check file to the DIAGUTIL_PATH environment variable:
diagnostics.py -p /opt/intel/oneapi/compiler/latest/sys_check/sys_check.sh --filter sys_check
- Add the path to the sys_check file to the DIAGUTIL_PATH environment variable manually:
- Developers who use Microsoft Visual Studio* 2019 should install CMake 3.21.5 or CMake 3.22.2 to use icx correctly with CMake.
- Since the vectorization is now enabled/disabled by using #pragma vector [no]vecremainder and internal switches, you must place additional #pragma omp simd in the code to achieve performance.
- The compile times can be significant when compiling for FPGA and using a read-only accessor for a very wide struct. Using a read-write accessor instead is a workaround to address long compile times.
- When compiling for FPGA, you cannot use a system installed with Intel® FPGA PAC D5005 to compile a SYCL application that targets Intel® PAC with Intel® Arria® 10 FX FPGA. Compilation may succeed, but the compiled binary might fail at runtime. There is no workaround available for this issue currently.
-
When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
dpcpp -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o
dpcpp -fintelfpga <other arguments> -Xshardware -kernel.o -
When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
-
In the FPGA optimization report, the Loop Viewer (Alpha) can only handle loops with 100 iterations or less currently. For designs with loops greater than 100 iterations, the optimization reports hang. There is no known workaround for this issue.
-
The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
-
FPGA optimization reports are not displayed correctly within Microsoft Visual Studio on Windows. Open the report.html file generated in the project directory to view the reports.
-
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
-
When compiling for FPGA, the Windows emulator flow using -c to create object files, linking through to an archive file, and then generating an executable from that archive might result in an executable that fails to launch device kernels. As a workaround for this issue, add the -fsycl-device-code-split=none flag to the archive step as shown in the following:
# generate .obj files dpcpp /EHsc -fintelfpga -c host.cpp device.cpp device_adder.cpp -DFPGA_EMULATOR # generate host.a dpcpp -fintelfpga -fsycl-link=image -fsycl-device-code-split=none host.obj device.obj device_adder.obj # generate .exe dpcpp -fintelfpga host.a /link /wholearchive # emulator executable host.exe
-
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
-
When compiling for FPGA on a Linux system, you might see Unable to open zlib library! error message when the compiler cannot detect the zlib library, which comes standard on most Linux OSes. As a workaround for the compiler to detect this library, install a development version of the library by executing one of the following OS-specific commands:
-
Ubuntu 18: sudo apt install zlib1g-dev
-
RHEL 7/CentOS 7: sudo yum install zlib-devel
-
-
When launching FPGA optimization reports, the compiler might fail to render certain text characters included in the source file. If the reports are crashing, verify whether the source file has any string literals that end in an escaped backslash in the fileJSON object’s content section within the report_data.js file under the reports/lib/ directory. As a workaround for this issue, modify the report_data.js file to escape the unescaped character. For example, change "hello\\" to "hello\\\".
-
When compiling for FPGA, integer modulo operation for less than or equal to 4 bits ac_int and ac_fixed is unsupported.
-
When compiling for FPGA, the global scope DSP control option -Xsdsp-mode=<> when passed after some other -Xs command option can result in a compiler failure. As a workaround, always pass the -Xsdsp-mode=<> option as the first -Xs command option.
oneAPI 2022.1, Compiler Release 2022.0.0
New Features and Improvements
- Vectorization for OpenMP SIMD was previously supported at O2 or above when OpenMP language features are enabled. It is now supported at O0 and above if OpenMP language features are enabled (e.g., -qopenmp, -qopenmp-simd).
- -fopenmp-target-simd to enable OpenMP SIMD support for GPU.
- -fopenmp-target-simdlen=n to specify GPU vector length for OpenMP SIMD loop.
- Added support for Target in_reduction clause from OpenMP 5.0 standard.
- Support for masked construct and tile construct from OpenMP 5.1 standard.
-
nowait for asynchronous offloading.
- Added support for new SYCL 2020 features sycl::logical_and and sycl::logical_or and completed support for Host Task. A complete list of SYCL 2020 features supported can be found here.
- Added the following DPC++ Extensions:
- Support for sRGBA which provides linearization of RGB color values that adjusts the color to be better matched with a particular hardware medium.
- Preview implementation for Matrix Programming Extension in SYCL.
- Support for SYCL_EXT_INTEL_BF16_CONVERSION.
- Removed support for deprecated SYCL 1.2.1 APIs as listed here.
- Support of SYCL half type in the global namespace has been removed to avoid potential conflicts with the user-defined type. This was previously an alias to the sycl::half type. To resolve compilation failures due to missing ::half type sycl::half type must be used directly.
- Added an experimental feature to speed up incremental build time of DPC++ applications which can be enabled using the compiler option -fsycl-max-parallel-link-jobs=<N>. This option tells the compiler that it can simultaneously spawn up to the specified number of processes to perform actions required to link DPC++ applications.
- Previous compiler releases included all LLVM tools in its bin directory. When added to PATH, some of these binaries were found to unexpectedly conflict with other LLVM installations on the system, so they are moved to a sibling bin-llvm directory. Compiler drivers (dpcpp/icx/icpx/ifx) are adjusted to find these internal tools as necessary, typically transparently to users. However, we recognize that there may be cases where the tools which are no longer in PATH were being invoked directly in some application Makefiles (or CMake configuration) and this may require adjustment. Refer to <…/bin/>../bin-llvm/README for more details.
- The compiler now uses the Windows registry as the default mechanism to discover the backend OpenCL ICDs on Windows. OCL_ICD_FILENAMES environment variable is for debugging only and does not work for administrative privileges on Windows.
- Added support for Microsoft Visual Studio* 2022.
- The Intel-specific header aligned_new is no longer included, as the functionality has been superseded by the C++17 aligned operator new feature. The functionality previously provided by aligned_new is now present in new and should be usable without any other changes besides altering the preprocessor include.
- Added support for the -Xssfcexit-fifo-type=<value> flag to globally control exit FIFO latency of stall-free clusters in FPGA.
- Added support for the nofusion loop attribute to prevent a loop from fusing with an adjacent loop in FPGA.
- Added support for the -Xsread-only-cache-size=<N> flag to enable the read-only cache for read-only accessors in FPGA.
- Deprecated the support for the hls_float data type and replaced it with ap_float data type for FPGA.
- Added support for open source runtime environment for FPGA.
- Added support for fast BSP customization flow for FPGA.
Bug Fixes
- Fixed an issue where dpcpp compiler was generating a temporary source file that is used during host compilation, which appears as a source dependency potentially breaking build environments that closely keep track of files generated during a compilation.
- Fixed an issue where sycl::link API could fail to JIT-compile user code if input kernel bundle/s contain more than one device image within them and specialization constants are used.
- When compiling for FPGA, if you declare kernel names locally, the kernel name is correctly demangled in FPGA optimization reports.
- Fixed an FPGA emulator issue where the compiler would fail if you had also installed a oneAPI-specific GPU platform.
Known Issues and Limitations
- The latest GPU driver available at https://dgpu-docs.intel.com/ introduces an Ahead-Of-Time (AOT) build issue for OpenMP offload applications running on Gen9 iGPU when using oneAPI compilers. A fix for this issue will be available in the upcoming driver release.
For assistance with downgrading to a version of the driver which does not have this issue, contact us via Graphics - Intel Communities. - GPU offload applications using extensive multi-threading (>2 threads) may experience hangs or timeout which can be recovered only through a hard reset or power cycling of the system for the following Linux Distributions. The issue occurs when reading/writing data to the Intel GPU while making extensive use of multi-threading due to a defect in older Linux kernels.
Kernel/distribution Problem occurs Problem does not occur RedHat Enterprise Linux RHEL 8.4 (kernel 4.18.0-305) and older RHEL 8.5 (kernel 4.18.0-348) SUSE Linux SLES15 SP3 and older SLES15 SP4 beta Ubuntu Linux Ubuntu releases older than 20.04.03 Ubuntu 20.04.03 (kernel 5.11.0-40-generic #44~20.04.2-ubuntu)*
Preferred Workaround: Upgrade to a Linux distribution where the defect has been fixed. Note that the software will run, but a warning message will appear in kernel logs.
GPU software for Ubuntu 20.04.03 is available now via https://dgpu-docs.intel.com. Note that the software will run, but a warning message will appear in kernel logs.
GPU software for RHEL 8.5. will be available in Q1 2022 at the same location.
GPU software for SLES15 SP4 will be available shortly after the general availability of SLES15 SP4.
Alternative Workaround: Do not use extensive multi-threading in GPU-enabled applications, i.e. keep the number of threads no more than 2. For example, for applications using the oneAPI MPI library, use the single-threaded version of the MPI run-time library, rather than the multi-threaded version. Set the environment variable I_MPI_THREAD_SPLIT=0 to use the single-threaded version of MPI. - The OpenMP default loop schedule modifier for work-sharing loop constructs was changed to nonmonotonic when the schedule kind is dynamic or guided to conform to the OpenMP 5.0 standard. User code that assumes monotonic behavior may not work correctly with this change. Users can add the monotonic schedule modifier in the schedule clause to keep the previous code behavior.
- Performance degradation is expected with SYCL 2020 barriers compared to barriers in SYCL 1.2.1. The issue is currently under investigation and is expected to be fixed in a future release.
- When using a two-step Ahead of Time (AOT) compilation with at least a single call to devicelib function from within the kernel, the device binary image may get corrupted.
- The alignment of allocation requests is limited to 64 KB due to limited support by Level Zero Runtime.
- SYCL 2020 Specialization constants feature has the following limitations:
- Building a program, which uses specialization constants for both JIT and AOT targets at the same time could result in an exception thrown with the following message: Native API failed. Native API returns: -49 (CL_INVALID_ARG_INDEX) -49 (CL_INVALID_ARG_INDEX).
- Setting specialization constant value to zero is ignored by DPC++ runtime in the non-AOT scenario, i.e. when -fsycl-targets command-line option is not passed or when spir64 is the target. Following is an example code demonstrating the issue. There is currently no workaround.
specialization_id<int> spec_id(42); // ... queue q; q.submit(handler &cgh) { cgh.set_specialization_constant<spec_id>(0); // spec_id will still have value 42 cgh.set_specialization_constant<spec_id>(41); // spec_id value will be changed to 41 cgh.set_specialization_constant<spec_id>(0); // spec_id will still have value 41 }
- In AOT mode, setting default values on padded objects can cause misalignment in other default values. This may cause specialization constants to have the wrong default values. For example:
struct PaddedStruct { uint32_t a; char b; constexpr PaddedStruct() : a(0), b('a') {} constexpr PaddedStruct(uint32_t _a, char _b) : a(_a), b(_b) {} }; constexpr specialization_id<PaddedStruct> padded_struct_spec_id{20, 'c'}; constexpr specialization_id<bool> bool_spec_id{true};
In this, PaddedStruct has a size of 8 bytes, 3 of which are padding. This can cause the specialization constant identified by bool_spec_id not to have default value of true. A known workaround to this issue is to remove the padding from a padded object by adding __attribute__((packed)) to class or struct, i.e PaddedStruct becomes:
struct __attribute__((packed)) PaddedStruct { uint32_t a; char b; constexpr PaddedStruct() : a(0), b('a') {} constexpr PaddedStruct(uint32_t _a, char _b) : a(_a), b(_b) {} };
- Usage of compiler option -Qlong-double on Windows* has limitations when using with the latest Microsoft Visual Studio* releases, detailed information is available here.
- The error of undefined reference to sinpif and cospif functions such as Compilation from IR - skipping loading of FCL error: undefined reference to `sinpif' without them being used in application code is caused by a compiler optimization phase. Workaround is to use compiler flags -mllvm -enable-transform-sin-cos=0 which disables the faulty optimization.
- Using #pragma omp declare simd on a member template is currently not supported and can lead to the error "error: function declaration is expected after 'declare simd' directive`. Non-template member functions and template functions that are not a member of a class are not affected.
- Using Microsoft Visual Studio* as a host compiler for DPC++ with C++17 enabled causes the error C:\Program Files (x86)\Intel\oneAPI\compiler\latest\windows\include\sycl\CL/sycl/ONEAPI/accessor_property_list.hpp(199): error C2686: cannot overload static and non-static member functions with the same parameter types. Refer to the article here on how to workaround this issue.
- USM support for implicit migrations of shared allocations between device and host is currently implemented in SW using access violation mechanisms (e.g. SIGSEV) to identify access from the host. Undefined behavior may occur if applications rely on similar access-violation mechanisms, or they use system calls to access shared-memory allocations before being migrated to the host by the GPU driver.
- icx compiler does not support linking library archives using the -l option for libraries that contain target offload code. More details and workaround for this issue can be found at Known Issue: Static Libraries and Target Offload.
- Attempt to use Link Time Optimization (LTO) is causing a linker failure. To successfully link, make sure you have the recommended versions of binutils for your OS listed at Intel® oneAPI DPC++/C++ Compiler and Intel® oneAPI DPC++ Library System Requirements.
- User-defined functions with the same name and signature (exact match of arguments, return type does not matter) as an OpenCL C built-in function, can lead to Undefined Behavior. More details about this issue can be found at Known Issue: User-defined Functions with the Same Signature as OpenCL C built-in functions.
- #pragma float_control that occurs at file scope is not correctly effective for statement blocks that are nested within class definitions. The same issue exists for #pragma clang fp.
- When debugging FPGA emulator code in Microsoft Visual Studio* on a Windows* system, the debugger does not stop at breakpoints set in kernel code. There is no workaround available for this issue currently.
- The compile times can be significant when compiling for FPGA and using a read-only accessor for a very wide struct. Using a read-write accessor instead is a workaround to address long compile times.
- When compiling for FPGA, you cannot use a system installed with Intel® FPGA PAC D5005 to compile a SYCL application that targets Intel® PAC with Intel® Arria® 10 FX FPGA. Compilation may succeed, but the compiled binary might fail at runtime. There is no workaround available for this issue currently.
-
When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
dpcpp -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o
dpcpp -fintelfpga <other arguments> -Xshardware -kernel.o -
When compiling for FPGA, the debug support on Windows is not available when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
-
In the FPGA optimization report, the Loop Viewer (Alpha) can only handle loops with 100 iterations or less currently. For designs with loops greater than 100 iterations, the optimization reports hang. There is no known workaround for this issue.
-
The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
-
FPGA optimization reports are not displayed correctly within Microsoft Visual Studio on Windows. Open the report.html file generated in the project directory to view the reports.
-
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
-
On Windows system, when compiling for FPGA emulator flow, using -c to create object files, linking through to an archive file, and generating an executable from that archive might result in an executable that fails to launch device kernels. As a workaround for this issue, add the -fsycl-device-code-split=none flag to the archive step as shown in the following:
# generate .obj files
dpcpp /EHsc -fintelfpga -c host.cpp device.cpp device_adder.cpp -DFPGA_EMULATOR
# generate host.a
dpcpp -fintelfpga -fsycl-link=image -fsycl-device-code-split=none host.obj device.obj device_adder.obj
# generate .exe
dpcpp -fintelfpga host.a /link /wholearchive
# emulator executable
host.exe -
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
-
When compiling for FPGA on a Linux system, you might see Unable to open zlib library! error message when the compiler cannot detect the zlib library, which comes standard on most Linux OSes. As a workaround for the compiler to detect this library, install a development version of the library by executing one of the following OS-specific commands:
-
Ubuntu 18: sudo apt install zlib1g-dev
-
RHEL 7/CentOS 7: sudo yum install zlib-devel
-
-
When launching FPGA optimization reports, the compiler might fail to render certain text characters included in the source file. If the reports are crashing, verify whether the source file has any string literals that end in an escaped backslash in the fileJSON object’s content section within the report_data.js file under the reports/lib/ directory. As a workaround for this issue, modify the report_data.js file to escape the unescaped character. For example, change "hello\\" to "hello\\\".
System Requirements
Additional Documentation
- Get Started with the Intel® oneAPI Toolkits for Linux*
- Get Started with the Intel® oneAPI Toolkits for Windows*
- OneAPI Versioning Schema based on Semantic Versioning
- Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
-
SYCL* 2020 Specification Features and DPC++ Language Extensions Supported
-
OpenMP* Features and Extensions Supported in Intel® oneAPI DPC++/C++ Compiler
Previous oneAPI Releases
Notices and Disclaimers
Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.
Intel technologies may require enabled hardware, software, or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.