This document summarizes new and changed product features and includes notes about features and problems not described in the product documentation.
Where to Find the Release
Please follow the steps to download the toolkit from the Base Toolkit Download page and follow the installation instructions to install.
Compiler Release 2023.2.4
NOTE: If you install or installed Intel compilers as part of the oneAPI 2023.2 release of the Intel® oneAPI Base Toolkit, the Intel® oneAPI HPC Toolkit, the Intel® oneAPI IoT Toolkit, or from the oneAPI Standalone Component page, please install the appropriate patch for your environment.
- Minor bug fixes and security updates
Compiler Release 2023.2.3
NOTE: If you install or installed Intel compilers as part of the oneAPI 2023.2 release of the Intel® oneAPI Base Toolkit, the Intel® oneAPI HPC Toolkit, the Intel® oneAPI IoT Toolkit, or from the oneAPI Standalone Component page, please install the appropriate patch for your environment.
- Bug fixes
Compiler Release 2023.2.2
Windows* only
NOTE: If you install or installed Intel compilers as part of the oneAPI 2023.2 release of the Intel® oneAPI Base Toolkit, the Intel® oneAPI HPC Toolkit, the Intel® oneAPI IoT Toolkit, or from the oneAPI Standalone Component page, please install the appropriate patch for your environment.
- Fixes a DPC++-specific compilation problem that might be exposed after migration from Microsoft Visual Studio* 2022 version 17.7.0 or newer.
Compiler Release 2023.2.1
If you install or installed Intel compilers as part of the oneAPI 2023.2 release of the Intel® oneAPI Base Toolkit, the Intel® oneAPI HPC Toolkit, the Intel® oneAPI IoT Toolkit, or from the oneAPI Standalone Component page, please install the appropriate patch for your environment.
Two patches are now available, one for each of the Intel C++ and Fortran compilers, that were published as part of oneAPI 2023.2:
* Intel® oneAPI DPC+/C+ Compiler and Intel® C++ Compiler Classic
* Intel® Fortran Compiler Classic and Intel® Fortran Compiler
The patch version is 2023.2.1. This patch fixes the issue with Linux modulefiles and an issue with using FPGA.
See "Known Issues and Limitations" for additional information.
oneAPI 2023.2, Compiler Release 2023.2
New Features and Improvements
-
Added semi-dynamic SLM allocation - sycl::ext::intel::experimental::esimd::slm_allocator.
-
Implemented sycl_ext_oneapi_matrix extension using a new unified interface.
-
Partially implemented sycl_ext_codeplay_kernel_fusion extension for Linux: Intel® CPU and GPU devices are fully supported, except specialization constants, streams, images, and reductions.
-
Added default constructor to sycl::local_accessor.
-
Aligned sycl::vec aliases with SYCL 2020.
-
Partially implemented sycl_ext_intel_fpga_kernel_interface_properties extension.
-
Added support for sycl::ext::oneapi::experimental::if_architecture_is to ESIMD kernels.
-
Implemented sycl::opencl::cl_* aliases.
-
Implemented annotated_arg headers from sycl_ext_oneapi_annotated_arg extension.
-
Added bfloat16 support to imf libdevice functions.
-
Added sycl::device::get_info<info::device::aspects> specialization.
-
Added rbegin(), rend(), crbegin() and crend() methods to sycl::accessor, sycl::host_accessor and sycl::local_accessor.
-
Added accessor_ptr alias to sycl::accessor and sycl::local_accessor.
-
Implemented get_multi_ptr method for sycl::accessor and sycl::local_accessor.
-
Added size_type to sycl::accessor and sycl::local_accessor classes.
-
Implemented accessor's implicit conversions.
-
Implemented templated sycl::kernel_bundle::has_kernel functions.
-
Implemented sycl::kernel_bundle::get_kernel function.
-
Added the ability to set specialization constants in sycl::compile.
-
Added the value_type and binary_operation member aliases and the dimensions value to the sycl::reducer class.
-
Implemented sycl::ext::intel::math::inv free function.
-
Added overloads on all binary operators with scalars as the left operand to sycl::marray, and allowed half, float, double in && and || operators.
-
Added support of Intel® Xe Matrix Extensions with SIMD8 capability.
-
Implemented device headers part of sycl::any_device_has and sycl::all_devices_have.
-
Added host_task in sycl::target.
-
Added support of sycl::marray to relational, common, and some math functions.
-
Implemented sycl_ext_oneapi_annotated_ptr extension.
-
Introduced an API to get thread ID and sub-device ID for ESIMD extension.
-
Implemented kernel_bundle::native_specialization_constant().
-
Implemented SYCL 2020 reductions without a known or specified identity.
-
Implemented basic functionality for the following group types: ballot_group, cluster_group, tangle_group, and opportunistic_group as part of sycl_ext_oneapi_non_uniform_groups extension.
-
Implemented is_group trait and changed the definition of is_group_v to align with SYCL 2020.
-
Implemented atomic_memory_scope_capabilities and atomic_memory_order_capabilities device queries for OpenCL and Level Zero backends.
-
Implemented atomic_fence_scope_capabilities and atomic_fence_order_capabilities device queries for OpenCL and Level Zero backends.
-
Implemented sycl::ext::intel::experimental::esimd::bfn function from ESIMD extension.
-
Made sycl::ext::oneapi::experimental::joint_reduce work with sycl::sub_group.
-
Improved sycl::reqd_work_group_size for optional dimensions.
-
Enabled DPC++ build with llvm-mingw toolchain on Windows.
-
Deprecated usage of -sycl-std=1.2.1.
-
Disabled force inlining of kernel call operator for FGPA.
-
Deprecated sycl::ext::oneapi::leader(group), SYCL 2020 member function group::leader should be used instead.
-
Deprecated get_access for sycl::host_accessor in accordance with SYCL 2020.
-
Deprecated get_size() and get_max_statement_size() member functions in the sycl::stream class, added replacements: size() and get_work_item_buffer_size().
-
Deprecated legacy scalar and vector type aliases.
-
Empty platforms are no longer returned.
-
Started to throw an exception when a selected device is not in context.
-
Changed signature of sycl::host_accessor::get_pointer in accordance with SYCL 2020.
-
Started to ignore placeholder template parameter for accessors in accordance with SYCL 2020.
-
Started to throw the correct exception when passing unbound accessor to command.
-
Hid the members of sycl::reducer that are not mentioned in the SYCL 2020 specification and introduced the identity member function.
-
Improved group_local_memory_for_overwrite (part of sycl_ext_oneapi_local_memory) by default initializing the returned memory, if needed.
-
Augmented sycl::has_known_identity to return true for std::complex and std::plus operators.
-
Improved diagnostic when sycl::accessor isn't bound to sycl::handler.
-
Added identities for arithmetic and integral operations on bool in accordance with the SYCL 2020 specification, and added a check for the specification-defined identities.
-
Renamed raw_send{s}_{load/store} APIs in ESIMD extension.
-
Made sycl::reducer uncopyable and immovable in accordance with SYCL 2020.
-
Updated aspects list to be SYCL 2020 compliant.
-
Updated sycl::kernel_bundle::has_kernel(kernel_id, device) to return "true" if the kernel is valid for a given device in accordance with SYCL 2020.
- The implementation of host pipes has changed in the Intel FPGA IP Authoring flow.
- When compiling for FPGAs, using sycl::(u)intxx conflicts with ac_intN::(u)intxx (same name). In oneAPI releases prior to the 2023.2 release, the ac_int.hpp header file used to automatically define the using namespace ac_intN, so you had to be cautious of potential type conflicts when including both sycl.hpp and ac_int.hpp header files. Starting from the 2023.2 release, the ac_int.hpp header file no longer includes using namespace ac_intN, so if your source code includes ac_intN::(u)intxx without specifying the namespace, you must rectify it by adding the correct namespace.
- Added support for the FPGA optimization target -Xsoptimize=throughput to allow compiling FPGA designs with the maximum throughput without area optimization heuristics flow.
- Added support for the FPGA optimization flag -Xsuse-2xclock to explicitly create a 2xclock interface for a given design.
- Added support for the FPGA optimization flag -Xsregister-map-wrapper-type=<default|high-fmax|low-latency> to generate register map wrapper.
- ESIMD emulator support is deprecated and will be removed in oneAPI 2024.0.
Bug Fixes
-
Fixed -f[no-]sycl-rdc option in "clang-cl."
-
Fixed emission of FPGA annotations.
-
Cleaned up Windows defaultlib linking behaviors.
-
Fixed an issue where the dimensions of the work-group hint were not correctly reversed.
-
Fixed a crash during the generation of debugging information with the option -ffile-prefix-map.
-
Fixed operator- for constant accessor iterator.
-
Fixed the case when sub-device free memory may exceed root-device free memory.
-
Fixed sycl::handler::require() to accept a non-placeholder argument.
-
Fixed sycl::ext::oneapi::experimental::joint_reduce; it missed the case when the work-group size is bigger than the size of input data.
-
Added missing include of sycl/builtins.hpp to sycl/ext/intel/math.hpp.
-
Fixed host compilation issue for lsc_atomic_update for store operation in ESIMD extension.
-
Fixed event profiling for sycl::info::event_profiling::command_submit in Level Zero and other backends.
-
Fixed sycl::group::get_local_linear_id(). Now, it uses the correct range to linearize the local ID.
-
Fixed compilation issue when sycl::stream::size was called from device code.
-
Fixed sycl::host_accessor zero dimension constructors.
-
Fixed atomic memory orders in reduction implementation.
-
Fixed accessor subscript ambiguity and reference type.
-
Fixed an issue with missing explicit instantiations in sycl/properties/queue_properties.hpp.
-
Fixed return type of sycl::reducer::combine.
-
Fixed an error when a scalar offset is provided as a parameter to the API.
-
Fixed return type of identity-less reduction as it used an unexpected value for the deprecated placeholder template parameter in the accessor.
-
Fixed variadic sycl::marray constructor by accepting values that are indirectly convertible to the element type of the marray.
-
Fixed sycl::atomic_ref constructor to fix operator ambiguity.
-
Fixed handling of host-side memory in D memory operations.
-
Fixed constexpr initialization of sycl::vec for sycl::half.
-
Fixed default sycl::host_accessor iteration methods.
-
Fixed lsc_prefetch_d in ESIMD extension.
-
Fixed over-allocation in specialization constants.
-
Fixed sycl::handler::get_specialization_constant segmentation fault that happened in certain cases, likely due to strict aliasing violations.
-
Fixed a regression in ESIMD's atomic_update.
-
Fixed return type of the sycl::accessor::get_pointer and sycl::local_accessor::get_pointer.
-
Starting with the 2023.2 compiler release, using immediate command lists is the default submission mode on Intel® Data Center GPU Max Series running on Linux. For further details, please refer to the Level Zero Immediate Command Lists document.
-
Fixed an FPGA compilation issue where linking stages with a single icpx command resulted in missing source code browser in the generated FPGA optimization reports if the source code was not located in the current directory.
- Fixed an FPGA kernel compilation issue where the compiler returned a warning message when calling the sycl::ext::oneapi::experimental::printf() function.
- Fixed an FPGA issue where all loop attributes’ metadata was not generated if applied to a do-while(1) loop, and the kernel was being submitted to sycl::queue directly in a lambda expression.
- Fixed two issues with the FPGA optimization reports where user-defined loop labels were not working as expected, and the Details section of the report was not reporting the critical path when II = 2 or the fMAX degraded.
- Fixed an issue with the FPGA code sample Shannonization and added Windows support back to this sample.
- Fixed an FPGA compilation issue on Windows where the compiler crashed sporadically with the Running pass 'ConvertKernelArgAnnToMetadata' error message.
- Fixed an FPGA compilation issue where the output lost the sign after conversion when calling to_double() on an ac_int variable of size (8*N + 1) inside a kernel.
- Fixed two issues in the FPGA IP authoring flow where the compiler would issue an error message when you apply LSU controls or split the IP implementation between header files and source files.
- Fixed multiple issues in the FPGA emulation flow relating to the task_sequence functions.
- Fixed an FPGA IP authoring flow issue where host-only designs were not supported, and the compiler would crash when compiled with the -Xstarget command option without a kernel in the program.
Known Issues and Limitations
-
If you installed Intel compilers as part of the oneAPI 2023.2 release of the Intel® oneAPI Base Toolkit, the Intel® oneAPI HPC Toolkit, the Intel® oneAPI IoT Toolkit, or from the oneAPI Standalone Component page, please install the appropriate patch for your environment.
Two patches are now available, one for each of the Intel C++ and Fortran compilers, that were published as part of oneAPI 2023.2:
* Intel® oneAPI DPC+/C+ Compiler and Intel® C++ Compiler Classic
* Intel® Fortran Compiler Classic and Intel® Fortran CompilerThe patch version is 2023.2.1.
These patches apply only to Linux* and Windows*.
These patches resolve the issue of missing Environment Modules utility modulefiles and other issues.
The patches are available on the Intel® Registration Center, other distribution channels, like APT, YUM, and the standalone component page.
- The yum/rpm/apt packages containing the Intel oneAPI DPC+/C++ Compiler do not have the correct GNU g++ dependency information in them. Users should install the same G++ version as gcc version they currently have installed to avoid an unusable compiler.
- When non-ASNI characters are in the Windows* path while using the sycl runtime, SYCL runtime will report that no devices are currently available. The work around is to remove non-ASNI characters from the path. We are currently working on a fix for this and will update when timelines for this fix are available.
- The following OpenMP offloading features are not currently supported:
- Non-placement new and delete
- Register and thread-local storage qualifiers
- Virtual functions (supported in C++ but not yet in Fortran)
- Exception handling
- C++ standard library (only printf() is support for GPU)
- Variadic function
- Variable Length Array (VLA) is not supported for the tasking model and offloading.
- Having MESA OpenCL implementation that provides no devices on a system may cause incorrect device discovery. As a workaround, such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd.
- -fsycl-dead-args-optimization can't help eliminate the offset of the accessor even though it is created with no offset specified.
- SYCL barriers show worse performance than SYCL 1.2.1's do.
- When using fallback assert in a separate compilation flow, it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv.
- Limit alignment of allocation requests at KB, the only alignment that Level Zero supports.
- User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e., an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
- A DPC system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with it and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
- The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
- Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior.
- Linkage errors with the following message: error LNK: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies -std=c++14 or /std:c++14.
- Printing internal defines is not supported on Windows.
- With the 2023.2 release, SYCL bindless textures only support non-Intel® architecture hardware; the support is planned to be in 2024.1.
- When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
- The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
- On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1' -
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
-
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
- When compiling for FPGA and trying to reduce the II of the II-critical path, the scheduler may return an incorrect II-critical path. This means the compiler reduces the II of the wrong path, and the II goal is not achieved. You might observe this issue only when multiple negative cycles are in the LSU's critical path. There is no known workaround for this issue. However, your design’s functionality stays unaffected. Performance (QoR) might get degraded slightly.
- When compiling for FPGA, the compiler might produce a different intermediate representation (IR) on Windows than on Linux. Misaligned structs cause this issue. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows. As a workaround, force an alignment on the misaligned structs, as shown in the following example:
-
//Code with misaligned struct struct Item { bool valid; int value1; unsigned char value2; }; //Forced alignment of struct struct Item { bool valid; bool __empty__[3]; int value1; unsigned char value2; unsigned char __empty2__[3]; }
- In the FPGA optimization report, the Loop Viewer (Alpha) can only handle loops with 100 iterations or less currently. For designs with loops greater than 100 iterations, the optimization reports hang. There is no known workaround for this issue.
- The FPGA optimization report reports incorrect area utilization data from Quartus compiles for Intel Quartus Prime Pro Edition software versions 23.1 and later. Currently, there is no known workaround for this issue.
- On Windows, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display the GPU process launch failed error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool from the mapped network drive to your local computer and run it locally.
- Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, either use the Intel oneAPI FPGA Reports Tool or copy the FPGA optimization reports directory to a local file system and launch it using a supported browser.
- When compiler for the FPGA optimization report flow, the list of optimization flags used for a compile may be incomplete or unavailable in the FPGA Reports Summary Page under certain circumstances, such as when using the -g0 flag. As a workaround for this issue, avoid using the -g0 flag in your compilation command. Also, if you use the -ghdl, let it be the last argument in your command.
- When generating FPGA optimization reports, the compiler might crash for any design with pipes having a capacity 0. If only a few pipes (but not all pipes) have a capacity of 0 in the design, then only those with a capacity will appear in the Area report. As a workaround for the compiler crash, assign a capacity (for example, 1) to one of the pipes with capacity 0.
- When compiling for FPGA, if you specify output target names that are pure numbers or that start with a number, the compiler errors out and might display an error message, as shown in the following example:
icpx -fsycl -fintelfpga -Xssimulation basic.cpp -o 2 aoc: Compiling for Simulator. Error: Simulation system generation FAILED. Refer to 2.prj/2.log for details. llvm-foreach: icpx: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
- Task Sequence functions with struct returns are currently unsupported due to a known issue.
- When compiling for FPGA, the compiler ignores the sycl::property::buffer::mem_channel buffer property. Irrespective of whether you specify the property or not, all buffer allocations are allocated to the first memory channel. Currently, there is no known workaround for this issue.
- When running pipelined kernels in the FPGA simulation flow, the compiler may not achieve the lowest possible II in the waveforms. Currently, there is no known workaround for this issue.
- When compiling for FPGA, the compiler might ignore non-RTL source library functions when the library archive file also contains RTL source objects and report the following error message:
Compiler Error: undefined reference to <non-RTL source library function>
As a workaround for this issue, avoid placing RTL and non-RTL source library objects in the same archive file. - When compiling a hyper-optimized loop using the [[intel::max_reinvocation_delay]] FPGA loop attribute, you might encounter the following assert message for the loop with local memory LSUs on the loop’s II critical path due to loop-carried memory dependency:
Error: Assert failure at ../../../source/acl/llvm-project/llvm/lib/Target/FPGA/Griffin/GriffinIISearch.cpp(258)
m_specified_max_II == 0 || m_forced_II == 0 FAILED
As a workaround for this issue, remove the [[intel::max_reinvocation_delay]] attribute from your loop, as it is unlikely that your loop can achieve a reinvocation delay of 1 (the only supported value currently). - The compiler is not constrained to the specified LSU style when requesting a particular LSU style using the FPGA LSU controls for a struct data type. Instead, it chooses the best LSU style for the access pattern. As a workaround for this issue, avoid using LSU controls with the struct data type and use simple data types instead.
- When compiling for FPGA IP Authoring flow only with Intel® Quartus® Prime Pro Edition software, the RTL library feature is not working as expected, and the compilation might fail in the late stages. As a workaround for this issue, compile RTL libraries in the simulation flow.
-
With the FPGA IP Authoring flow, you can intuitively integrate your design into the Platform Designer by copying the generated .prj folder into your Intel® Quartus® Prime project directory. The Platform Designer detects the project automatically. However, there is a known issue with the generated hw.tcl file, which is not mapping the signals correctly. To work around this issue, follow these steps on both Linux and Windows systems:
-
Add python to your PATH environment variable to run python from your command line.
-
Execute the following commands to run the <kernel-name>_di_hw_tcl_adjustment_script.py python script generated in your .prj directory before integrating your IP authoring kernel into the Platform Designer:
$ cd <kernel_name>.prj $ python <kernel-name>_di_hw_tcl_adjustment_script.py
-
- The FPGA IP authoring encryption flow is not fully supported on Windows systems.
- In the FPGA IP Authoring flow, the compiler reports the following error when you use mmhost macros on kernel arguments that are used inside a lambda within the kernel function:
Compiler Error: Could not generate the requested kernel argument interfaces.
Error: Optimizer FAILED.
As a workaround for this issue, replace MyLambda with an equivalent function call.
Consider the following example where you would observe this issue:class FeederKernel { mmhost( kBL2, // buffer_location or space 28, // address width 256, // data width 0, // latency 0, // read_write_mode, 0: ReadWrite, 1: Read, 2: Write 1, // maxburst 0, // align, 0 defaults to alignment of the type 1 // waitrequest, 0: false, 1: true ) int *MB; public: FeederKernel(int *MB_in) : MB(MB_in) {} void operator()() const { int reg[N]; MyLambda([&](auto i) { reg[i] = MB[i]; }); } };
Based on the issue with using mmhost macros on kernel arguments, the fpga_tools::UnrolledLoop utility defined in the unrolled_loop.hpp code sample header file does not support the kernel argument interface macros (mmhost, conduit_mmhost, and register_map_mmhost).
For example:
fpga_tools::UnrolledLoop<ROWS>([&](auto row) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; });
As a workaround, use the #pragma unroll before a for loop, as shown in the following example:
#pragma unroll for (int row = 0; row < ROWS; row++) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; }
- When compiling for FPGA, the compiler might crash if any global memory in the board_spec.xml file does not have a name field. Ensure that all global memories in the board_spec.xml file have a name field. For example, <global_mem name="DDR" ... >
- When compiling for FPGA and linking multiple fat static libraries containing the device code (produced using the -fsycl-link=image flag), only the device code from the first library is included in the fat executable, and the following error message is returned:
> what(): native api failed. native api returns: -46 (pi_error_invalid_kernel_name)
> terminate called after throwing an instance of 'sycl::_v1::exception
As a workaround for this issue, dynamically link the host code instead of linking statically.
Example compile commands:icpx -fsycl main.cpp -c -o main.o icpx -fsycl -fintelfpga -fpic -shared add_kernel.cpp -o libadd_kernel.so icpx -fsycl -fintelfpga -fpic -shared sub_kernel.cpp -o libsub_kernel.so icpx -fsycl -fintelfpga main.o -L. -ladd_kernel -lsub_kernel -o hot_swapper LD_LIBRARY_PATH=$LD_LIBRARY_PATH:. ./hot_swapper
- If your design includes a device_global memory greater than 1024 bits in size and you have not initialized it in the kernel, then you might see incorrect behavior when compiling for the simulator. Memory size greater than 1024 bits can happen due to the following reasons:
- device_global is an array with a size greater than 1024 bits.
- device_global is a scalar that uses ac_ints (or other large types) with a size greater than 1024 bits.
- device_global is used in a coalesced load where the size of the load is greater than 1024 bits.
This is caused by a bug in the Intel Quartus Prime Pro Edition software that occurs when initializing memory using a MIF file. As a workaround for this issue, zero-initialize the contents of the device_global memory before accessing the memory.
- When you use the device_global variable in a single_task kernel and set the [[intel::max_global_work_dim()]] FPGA kernel attribute to 0, you might see an intermittent issue with the device_global variable initialization on the Intel® FPGA PAC D5005 or Intel Stratix 10 reference boards, resulting in an error in the hardware run. This is now the default behavior when using the latest SYCL resource version included with the oneAPI 2023.2 release. To avoid encountering this issue, do not use the [[intel::max_global_work_dim(0)]] FPGA kernel attribute.
Example code where you will hit this issue:device_global<bool, decltype(Properties(device_image_scope, host_access_none))> global_bool; int main(){ queue q; q.submit ([&] (handler& h)) { h.single_task<Task>([=] [[intel::max_global_work_dim(0)]] { //use of global_bool //... }); } }
- You might encounter functional failures in the FPGA emulation flow when resetting a device_global and a new device_image is loaded without the device_image scope property. Currently, there is no known workaround for this issue.
oneAPI 2023.1, Compiler Release 2023.1
New Features and Improvements
-
Intel® changed the default graphics compute binary format to a new portable ELF-based ZE Binary format beginning with the oneAPI 2023.1 release. To ensure a smooth transition to the new format ‘-ze-disable-zebin’/’-cl-disable-zebin’ graphics backend options are provided to fall back to the legacy binary format for either offline or online compilation. The new ZE Binary format will be generated by default and the deprecated legacy format will no longer be made available in a future release.
Examples to create the legacy format:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend=spir64_gen "-device * -options -cl-disable-zebin" file.cpp
icpx -qopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend=spir64_gen "-device * -options -cl-disable-zebin" file.cpp - Added support for per-object device code compilation under the option
-fno-sycl-rdc
. This improves compiler performance and reduces memory usage, but can only be used if there are no cross-object dependencies. - Added support for per-aspect device code split mode.
- Extended support for the large GRF mode to non-ESIMD kernels.
- Implemented the sycl_ext_intel_device_architecture extension.
- Implemented the
sycl_ext_intel_device_architecture
experimental extension. - Implemented accessor member functions
swap
, byte_size, max_size, andempty.
- Implemented SYCL 2020 default accessor constructor.
- Implemented SYCL 2020 accessor iterators.
- Changed value_type of read-only accessors to const in accordance with SYCL 2020.
- Implemented SYCL 2020 multi_ptr and address_space_cast.
- Implemented SYCL 2020 has_extension free functions.
- Implemented SYCL 2020 aspect_selector.
- Implemented new SYCL 2020 style FPGA selectors.
- Implemented SYCL 2020 default async_handler behavior.
- Implemented SYCL 2020 is_compatible free function.
- Implemented queue shortcut functions with placeholder accessors.
- Added support for creating a kernel bundle with descendent devices of the passed context's members.
- Implemented non-blocking destruction and deferred release of memory objects without attached host memory.
- Implemented the sycl_ext_oneapi_queue_priority extension.
- Implemented the sycl_ext_oneapi_user_defined_reductions extension.
- Implemented the sycl_ext_oneapi_queue_empty extension proposal.
- Implemented the sycl_ext_oneapi_weak_object extension.
- Implemented the sycl_ext_intel_cslice extension. The old behavior that exposed compute slices as sub-sub-devices is now deprecated. For compatibility purposes, it can be brought back via the SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING environment variable.
- Implemented the sycl_ext_intel_queue_index extension.
- Implemented the sycl_ext_oneapi_memcpy2d extension.
- Implemented device ID, memory clock rate, and bus width information queries from the sycl_ext_intel_device_info extension.
- Implemented ext::oneapi::experimental::radix_sorter from the sycl_ext_oneapi_group_sort extension proposal.
- Added support for sorting over sub-groups.
- Added C++ API wrappers for the Intel math functions ceil, floor, rint, sqrt, rsqrt, and trunc.
- Implemented a SYCL device library for `bfloat16` Intel math function utilities.
- Added support for range reductions with any number of reduction variables.
- Added support for reductions with kernels accepting item.
- Enabled sub-group masks for 64-bit subgroups.
- Implemented the new non-experimental API for DPAS.
- Added 8/16-bit type support to lsc_block_load and lsc_block_store ESIMD API.
- Implemented atomic operation support in the ESIMD emulator.
- Added various trivial utility functions for the half type.
- Added type cast functions between half and float/integer types to libdevice.
- Implemented the ONEAPI_DEVICE_SELECTOR environment variable that, in addition to supporting SYCL_DEVICE_FILTER syntax, allows to expose GPU sub-devices as SYCL root devices and supports negative filters. SYCL_DEVICE_FILTER is now deprecated.
- Added the SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION environment variable.
- Added the `InferAddressSpaces` pass to the SPIR/SPIR-V compilation pipeline, reducing the size of the generated device code.
- Redesigned pointer handling so that it no longer decomposes kernel argument types containing pointers. The kernel lambda operator is now always inlined in the device code entry point unless -O0 is used.
- Improved entry point handling in the sycl-post-link tool.
- The reqd_work_group_size attribute now works with 1, 2, or 3 operands.
- Enabled using -fcf-protection option with -fsycl, which results in it being applied only to host code compilation and producing a warning.
- Linux-based compiler driver on Windows now pulls in the `sycld` debug library when `msvcrtd` is specified as a dependent library.
- Added /Zc:__cplusplus as a default option during host compilation with MSVC.
- Improved the ESIMDOptimizeVecArgCallConv optimization pass to cover more IR patterns.
- Added support for more types in ESIMD lsc functions.
- Added error diagnostics for using sycl::ext::oneapi::experimental::annotated_arg/ptr as a nested type.
- The status of bfloat16 support was changed from experimental to supported.
- Updated online_compiler with Gen12 GPU support.
- get_kernel_bundle and has_kernel_bundle now check that the kernels are compatible with the devices. Waiting for an event associated with a kernel that uses a stream now also waits for the stream to be flushed.
- Added the requested device type to the message of the exception thrown when no such devices are found.
- Optimized operator[] of host_accessor.
- Improved reduction performance on discrete GPUs.
- Added invoke_simd support for functions with void return type.
- The Level Zero plugin now creates every event as host-visible by default.
- Added Level Zero plugin support for global work sizes greater than UINT32_MAX as long as they are divisible by some legal work-group size and the resulting quotient does not exceed UINT32_MAX.
- Improved native Level Zero event handling in the immediate command list mode by removing excessive status queries.
- Removed an uninitialized buffer migration copy in the Level Zero plugin.
- Implemented an optimization that reuses discarded Level Zero events in the plugin.
- The host device is now inaccessible.
- Removed deprecated make_queue API.
-
Deprecated group::get_global_range().
-
Added support for FPGA IP authoring flow. It allows you to target your SYCL* code to generate standalone IP components on different targets and integrate it into a custom Intel® Quartus® Prime project. You can target your compilation to a supported Intel® FPGA device family or part number instead of a specific acceleration platform.
- Updated the FPGA product family name to “Intel Agilex 7.”
- Added support for minimum latency flow to gauge the FPGA performance.
- Added support for max_reinvocation_delay attribute in FPGA.
- In addition to the existing values, added support to the on value for the -Xshyper-optimized-handshaking compiler option for FPGA.
- Added support for the device_globals FPGA extension.
Bug Fixes
- Fixed a crash when attempting to compile code that uses a function object without a defined call operator as a kernel.
- Fixed a crash that occurred during the compilation of device code with a captured structured binding.
- Fixed the work_group_size_hint attribute not being applicable to lambda functions using non-conforming syntax.
- Fixed integration header parameter kind information for annotated types.
- Fixed an issue with offload dependencies when using -fsycl-force-target.
- Fixed debug information generation when an integration footer is present.
- Fixed a __builtin_printf related error when compiling device code with _GLIBCXX_ASSERTIONS=1.
- Fixed a compiler error that occurred during archive generation when using -fsycl-link for FPGA.
- Fixed memory corruption caused by the ESIMDOptimizeVecArgCallConv pass.
- Fixed a crash during ESIMD intrinsic generation.
- Fixed libclc function mangling.
- Fixed an issue where the in-order queue property was not respected when submitting USM commands and host tasks.
- Fixed a memory leak when enqueueing a barrier to a discard_events queue.
- Fixed a memory leak related to submitting host tasks without memory object dependencies.
- Fixed an invalid event error when handling cross-queue no-op dependencies.
- Fixed an error when setting a specialization constant in a command group with no kernel.
- Fixed an issue where submitting a kernel that explicitly depends on a host task was a blocking call that waited for the host task.
- Removed noexcept from some of usm_allocator member functions to align with the specification.
- Fixed ext::intel::experimental::atomic_update with the fcmpwr operation.
- Fixed memory leak issues when constructing a SYCL kernel/kernel_bundle using interoperability.
- Fixed an error where the native handle returned by get_native from a default constructed event was unusable.
- Fixed an issue where reinterpreting a buffer to a const type changed the corresponding buffer_allocator type to const.
- Fixed handler::set_arg with local_accessor.
- Added the missing default template argument for sycl::info::device::max_work_item_sizes.
- Fixed an issue where some aspects could be incorrectly reported as unsupported by a device.
- Fixed return type of scalar versions of relational functions. The fix requires defining SYCL2020_CONFORMANT_APIS macro.
- Fixed an issue where the device code cache was not used if the compilation was triggered by different paths.
- Fixed a use-after-move bug when caching device code built for multiple devices.
- Removed the unintended requirement of fp64 support from stream and ESIMD float fmod implementations.
- Fixed several complex math operations failing on devices that don't support fp64.
- Aligned host side float-to-half mantissa rounding with device side.
- Fixed float-to-half conversion of the half minimum subnormal value on the host.
- Fixed marray math function implementation.
- Fixed an out-of-bounds write in the group operations implementation.
- Fixed a reduction performance regression caused by using the wrong implementation for the float type.
- Fixed header deprecation warnings to work properly on Linux.
- Fixed deprecation of SYCL 1.2.1 device selectors.
- Fixed multiple issues in GDB xmethods scripts.
- Fixed an issue with sycl-prof JSON output.
- Fixed compilation errors on Windows when using the ESIMD API.
- Fixed invalid calculation in the ESIMD tanh function.
- Fixed kernel_bundle errors when using ESIMD emulator devices.
- Fixed an issue where the ESIMD emulator was picked by the default selector even in the presence of other devices.
- Fixed an error when querying an ESIMD emulator device for sub-group sizes.
- Fixed invalid behavior of the maximum sub-group size query on some OpenCL systems.
- Fixed an issue where the OpenCL plugin checked whether a program is supported on a device by looking up platform version/extensions rather than device ones.
- Fixed the result of the free device memory query with the Level Zero backend.
- Fixed an issue with ext_oneapi_barrier not working when using the Level Zero backend.
- Fixed a hang after submitting a barrier to a Level Zero in-order queue.
- Fixed an issue that occurred when submitting a barrier to a Level Zero queue with no prior submissions.
- Fixed a memory leak when tracking indirect access in the Level Zero plugin.
- Fixed an invalid read issue that occurred during the Level Zero event release.
- Fixed a synchronization issue when using device scope Level Zero events.
- Fixed an issue that occurred when using get_native on a newly constructed Level Zero queue.
- Fixed a segmentation fault related to events recycling in immediate command list mode in the Level Zero plugin.
- Fixed an issue where an invalid maximum of compute units was reported for Level Zero sub-sub-devices.
- Fixed a segmentation fault when using Level Zero sub-sub-devices with the immediate command lists mode.
- Reverted the Level Zero plugin change that preferred using copy engine for memory read/write operations due to functional regressions.
- Added the missing fp16 case of FMulKHR libclc function.
- Fixed an FPGA issue where simulating FPGA designs with a host channel led to two signal mismatch errors (dataBitsPerSymbol and firstSymbolInHigh OrderBits).
- Fixed an FPGA emulator issue where it was not recognizing different Avalon interfaces when defining a host pipe
- Fixed an issue with hyper-optimized loops in FPGA where the compiler could not implement the loop with an appropriate II for the given max_reinvocation_delay value.
- Fixed an FPGA compilation issue where SYCL code containing the std::popcount function inside a fixed-size loop (bit-widths not in 8, 16, 32, or 64) would get mapped directly into llvm.ctpop.
- Fixed an FPGA compilation issue where the compiler used a read-only accessor for a very wide struct, and the compile time was significantly high.
Known Issues and Limitations
-
Runtime Out of Memory Error Using GPU
-
There is a potential known issue with the cleanup of resources at queue synchronization points in longer running jobs (most likely to show up in multi-tile or multi-device setups) that can lead to resources on the device being used up and causing out of memory errors.
-
Workaround
-
In cases where this is identified, users can use the compiler hotfix available at below links to help address this situation. With this hotfix in place, cleanup of these resources will happen more often, i.e. after certain threshold of the number of allocated resources is hit. While a default value for this has been provided, users can enforce finer grain control through the usage of the environmental variable SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD. The value is defined as: If non-negative, then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of command lists in a queue exceeds this threshold, an attempt is made to cleanup completed command lists for their subsequent reuse. The default is 20.
-
-
- Users may encounter a segmentation fault when building ESIMD kernels using AOT (ahead-of-time) compilation. The workarounds are:
- Use JIT compilation which has logic to group kernels of same GRF mode into same L0 module.
- Use IGC Compiler options: -doubleGRF/-ze-opt-large-register-file even for an AOT build, but this will apply the option to all kernels in the given L0 module.
- A change between oneAPI 2023.0 and oneAPI 2023.1 prevents GDB* 10.0 and earlier versions from properly debugging SYCL and OpenMP* CPU offload code produced by the Intel® C, C++, and Fortran compilers. These older versions of GDB are present on RedHat EL8, Ubuntu 20.04, and Rocky 8. 2 workarounds can be used:
- Use the Intel® Distribution of GDB*.
- Download an open-source version of GDB after version 10.0.
- An error might occur when certain nested conditional clauses are vectorized without prior optimization passes. Please consider a loop like the below example which is marked with omp simd directive:
void foo(long *restrict lp1, long *restrict lp2, long *restrict lp3) { long l1; #pragma omp simd simdlen(8) for (l1 = 0; l1 < 8; l1++) { lp1[l1] = l1; if (lp1[l1] != l1) { /* if-1 */ if (lp2[l1 != 0]) /* if-2 */ lp3[2*l1] = 1; } } }
- When trying to run code that requires fp64/double on a hardware that does not support it, instead of seeing a message about unsupported hardware feature, users will see an error message saying "PI_ERROR_INVALID_ARG_VALUE". This is being resolved for the next release.
- When compiling AOT for GPU targets, there is an issue with the incorrect OCLOC being called to perform the device compilation. If the device name requested contains a dash ('-') in the name the offline compilation may fail with the following diagnostic (-device ats-m75 used here):
Unknown device range : ats Failed to parse target devices from : ats-m75
Primary Workaround:
Use the equivalent version value instead, -device 12.56.5.
This information can be acquired by using the OCLOC tool directly.> ocloc ids ats-m150 Matched ids: 12.55.8 > ocloc ids ats-m75 Matched ids: 12.56.5
Secondary workaround:1. Setup your environment to use the desired OCLOC directly.
2. Add ocloc.exe to your PATH
- Unset any environment variables that assist the compiler to find the OCLOC binary. OCLOCROOT OCLOCVER and potentially the OCLOC can be discovered via the LIB environment variable.
- When compiling for spir64_gen targets on Windows, there is a possibility of hitting memory limitations due to the number of ‘-device <arg>’ targets that are used and/or the size of the device libraries/objects being used during the link. If this "out of memory; Allocation failed; Exception Code: 0xC000001D .." is encountered, a potential workaround is to reduce the number of target devices for the given compilation.
- Having MESA OpenCL implementation which provides no devices on a system may cause incorrect device discovery. As a workaround, such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd.
- Compilation may fail on Windows in debug mode if a kernel uses std::array. This happens because debug version of std::array in Microsoft STL C++ headers calls functions that are illegal for the device code. As a workaround, the following can be done:
- Dump compiler pipeline execution strings by passing the -### option to the compiler. The compiler will print the internal execution strings of compilation tools. The actual compilation will not happen.
- Modify the (usually) first execution string (it should have -fsycl-is-device option) by adding -D_CONTAINER_DEBUG_LEVEL=0 -D_ITERATOR_DEBUG_LEVEL=0 options to the end of the string. Execute all strings one by one.
- -fsycl-dead-args-optimization can't help eliminate the offset of the accessor even though it's created with no offset specified.
- SYCL 2020 barriers show worse performance than SYCL 1.2.1.
- When using fallback assert in a separate compilation flow, it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv.
- Limit alignment of allocation requests at 64KB which is the only alignment supported by Level Zero.
- User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e. an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
- A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
- The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
- Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior.
- Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and the user specifies -std=c++14 or /std:c++14.
- Printing internal defines is not supported on Windows.
- When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
icpx -fsycl -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o icpx -fsycl -fintelfpga <other arguments> -Xshardware -kernel.o
- When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
- In the FPGA optimization report, the Loop Viewer (Alpha) can only handle loops with 100 iterations or less currently. For designs with loops greater than 100 iterations, the optimization reports hang. There is no known workaround for this issue.
-
The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
-
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp). -
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
-
When compiling for FPGA and trying to reduce the II of the II-critical path, the scheduler may return an incorrect II-critical path. This means the compiler reduces the II of the wrong path, and the II goal is not achieved. You might observe this issue only when multiple negative cycles are in the LSU's critical path. There is no known workaround for this issue. However, your design’s functionality stays unaffected. Performance (QoR) might get degraded slightly.
-
When compiling for FPGA, the compiler might produce a different intermediate representation (IR) on Windows than on Linux. Misaligned structs cause this issue. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows. As a workaround, force an alignment on the misaligned structs, as shown in the following example:
//Code with misaligned struct struct Item { bool valid; int value1; unsigned char value2; }; //Forced alignment of the struct struct Item { bool valid; bool __empty__[3]; int value1; unsigned char value2; unsigned char __empty2__[3]; }
-
When compiling for FPGA and trying to reduce the II of the II-critical path, the scheduler may return an incorrect II-critical path. This means the compiler reduces the II of the wrong path, and the II goal is not achieved. You might observe this issue only when there are multiple negative cycles in the LSU's critical path. There is no known workaround for this issue. However, your design’s functionality stays unaffected. Performance (QoR) might get degraded slightly.
-
With the FPGA IP Authoring flow, you can intuitively integrate your design into the Platform Designer by copying the generated .prj folder into your Intel® Quartus® Prime project directory. The Platform Designer detects the project automatically. However, there is a known issue with the generated hw.tcl file, which is not mapping the signals correctly. To work around this issue, follow these steps on both Linux and Windows systems:
-
Add python to your PATH environment variable to run python from your command line.
-
Execute the following commands to run the <kernel-name>_di_hw_tcl_adjustment_script.py python script generated in your .prj directory before integrating your IP authoring kernel into the Platform Designer:
$ cd <kernel_name>.prj $ python <kernel-name>_di_hw_tcl_adjustment_script.py
-
-
When compiling for the FPGA IP authoring flow, if you apply LSU controls, the compiler issues the “Cannot customize LSUs that access fixed latency MM host interfaces" error message. Currently, there is no known workaround to obtain the requested LSU styles in the IP Authoring flow.
- When using the FPGA IP Authoring flow, you might see the following error message:
Use still stuck around after Def is destroyed:i8* getelementptr inbounds ([XX x i8], [XX x i8]* <badref>, i32 0, i32 0)
aocl-opt: ../../../source/acl/llvm-project/llvm/lib/IR/Value.cpp:103: llvm::Value::~Value(): Assertion `materialized_use_empty() && "Uses remain when a value is destroyed!"' failed.
As a workaround for this issue, avoid splitting the IP implementation between header files and source files (for example, myip.h and myip.c). The entire definition and implementation must be in the same file (entirely either in the header file or the source file). -
When compiling for FPGA IP Authoring flow only with Intel® Quartus® Prime Pro Edition software, the RTL library feature is not working as expected, and the compilation might fail in the late stages. As a workaround for this issue, compile RTL libraries in the simulation flow.
- When compiling for FPGA, if you specify output target names that are pure numbers or that start with a number, the compiler errors out and might display an error message, as shown in the following example:
icpx -fsycl -fintelfpga -Xssimulation basic.cpp -o 2 aoc: Compiling for Simulator. Error: Simulation system generation FAILED. Refer to 2.prj/2.log for details. llvm-foreach: icpx: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
- When compiling for FPGA, the output might lose the sign after conversion when calling to_double() on an ac_int variable of size (8*N + 1) inside a kernel—for example, calling to_double() on an ac_int<33> with a value of -1 outputs 4294967295.0. The only workaround for this issue is to avoid this use case in your design.
-
When compiling an FPGA kernel that calls the sycl::ext::oneapi::experimental::printf() function, the compiler issues the following warning message:
-
compiler warning: argument 'llvm_fpga_printf_buffer_start' on component '<your kernel name>' is never used by the component. Note that the compiler may optimize it away.
There is no known workaround for this issue. However, you can ignore this warning since it does not impact the kernel’s functionality. -
On the Windows system, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display "GPU process launch failed" error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool application from the mapped network drive to your local computer and run it locally.
-
Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, either use the Intel oneAPI FPGA Reports Tool or copy the FPGA optimization reports directory to a local file system and launch it using a supported browser.
-
In the FPGA optimization reports, user-defined loop labels are not working as expected. Currently, there is no known workaround for this issue.
-
Due to the FPGA hardware run hang issue, Windows support has been removed from the Shannonization code sample.
-
The Intel FPGA IP authoring encryption flow is not fully supported on Windows systems.
-
In the Intel FPGA IP authoring flow, the fpga_tools::UnrolledLoop utility defined in the unrolled_loop.hpp code sample header file does not support the kernel argument interface macros (mmhost, conduit_mmhost, and register_map_mmhost). For example:
fpga_tools::UnrolledLoop<ROWS>([&](auto row) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; });
As a workaround, use the #pragma unroll before a for loop, as shown in the following example:
#pragma unroll for (int row = 0; row < ROWS; row++) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; }
-
Task Sequence functions with struct returns are currently unsupported due to a known issue.
-
When emulating an FPGA design involving task_sequence functions with multiple outstanding async() calls before they start calling the get() function, the resulting order of results that each get() call receives may not necessarily be the same as the order in which the async() calls were called. The workaround for this is to compile and verify in simulation that the order is as expected.
- When compiling for FPGA on Windows, sporadically you might see a compiler crash with the “Running pass 'ConvertKernelArgAnnToMetadata'” error message. Apply one of the following workarounds for this issue:
- For full system FPGA flow: Recompiling your design should resolve the issue. If recompiling once does not work and the compiler continues to crash, use the -Xsskip-convert-ptr-ann-to-metadata=true option in your icpx command. This compiler command option disables the feature that is leading to the compiler crash.
- For FPGA IP Authoring flow: Recompiling your design should resolve the issue. There is no known workaround if you continue to see the compiler crash. However, do not use the -Xsskip-convert-ptr-ann-to-metadata=true option.
-
When emulating your FPGA design, you might encounter a segmentation fault when the task_sequence function has no arguments. As a workaround for this issue, add a dummy argument to the task_sequence function and related async() function calls, as shown in the following example:
void task_a(bool dummy) { for (int i = 0; i < N; ++i) { int x = t2_channel::read(); data_out1::write(x); } } … bool dummy; ts_a.async(dummy);
- All FPGA loop attributes’ metadata are not generated if applied to a do-while(1) loop, and the kernel is being submitted to sycl::queue directly in a lambda expression.
For example, the intel::ivdep attribute is missing in the following code:cgh.single_task<class Kernel>([=]() { int i = 0; [[intel::ivdep]] do { if (i >= m) { break; } else { accessorA[i] = accessorA[i + k] * c; } ++i; } while (1); });
Apply one of the following workarounds for this issue:
- Workaround 1: Write the loop exit condition explicitly in the while loop, as shown in the following example:
cgh.single_task<class Kernel>([=]() { int i = 0; [[intel::ivdep]] do { if (i >= m) { break; } else { accessorA[i] = accessorA[i + k] * c; } ++i; } while (i < m); });
- Workaround 2: Write the kernel body in a separate function, and call the function from the lambda expression that submits the kernel to the sycl::queue, as shown in the following example:
template <typename T> void test(T& accessorA, int k, int c, int m) { int i = 0; [[intel::ivdep]] do { if (i >= m) { break; } else { accessorA[i] = accessorA[i + k] * c; } ++i; } while (1); } cgh.single_task<class Kernel>([=]() { te(accessorA, k, c, m); });
- Workaround 1: Write the loop exit condition explicitly in the while loop, as shown in the following example:
oneAPI 2023.0, Compiler Release 2023.0
New Features and Improvements
- The compiler has moved to using C++17 as the default C++ language. If users want to use an older version, they have to specify it as a compiler option. For example, if users want to use C++14, they need to use
-std=c++14
. - Added support for FPGA IP authoring flow. It allows you to target your SYCL* code to generate standalone IP components on different targets and integrate it into a custom Intel® Quartus® Prime project. You can target your compilation to a supported Intel® FPGA device family or part number instead of a specific acceleration platform.
- FPGA optimization reports now support user-defined loop labels replacing the system-generated loop labels. For example:
LOOP1: for( int i = 0; i < 12; i++ ) { ... }
- Added support for the standalone Intel® oneAPI FPGA Reports tool.
- Added support for using latency controls with a stall-free loop in FPGA.
- Added support to view simulation waveforms in the simulators supported by FPGA.
- Added ability to enforce stateless memory accesses for ESIMD.
- Added support for
-fsycl-force-target
compiler option. - Added support for
-fsycl-link-huge-device-code
compiler option, which allows linking object files larger than 2GB. - Implemented group collective built-in functions for more integral types.
- Implemented SYCL 2020 callable device selectors.
- Implemented SYCL 2020 standalone device selectors.
- Added SYCL 2020 property interfaces for
local_accessor
,usm_allocator
,accessor
andhost_accessor
classes. - Added support for
fpga_simulator_selector
. - Added support for
local_accessor
. Deprecatedtarget::local
. - Added support for querying free device memory on Level Zero backend.
- Implemented
bfloat16
conversions from/tofloat
for host. - Added support for
ext::oneapi::property::queue::discard_events
to Level Zero PI plugin. - Added
lsc_atomic
support on ESIMD emulator. - Added
dpas
support on ESIMD emulator. - Added C++ API for
imf
libdevice built-ins. - Introduced predicates for ESIMD
lsc_block_store/load
. - Added experimental
set_kernel_properties
API anduse_double_grf
property for ESIMD. - Added "eager initialization" mode to Level Zero PI plugin. It might result in unnecessary work done by the plugin, but it ensures the fastest possible execution on hot and reportable paths.
- Implemented
group::get_linear_id(int)
method. - Ensured that a correct
errc
thrown for an unassociated placeholder accessor. - Removed dependency on OpenCL ICD Loader from the runtime.
- Added support for
ZEBIN
format to persistent caching mechanism. - Added identification mechanism for binaries in the newer
ZEBIN
format. - Switched to use
struct
information descriptors in accordance with SYCL 2020. Removed some deprecated information queries. - Updated
kernel_device_specific::max_sub_group_size
query to match SYCL 2020 spec. Deprecated the old variant. - Deprecated SYCL 1.2.1 device selectors.
- Improved error messages reported for unsupported device partitioning.
- Made
device
andplatform
default todefault_selector_v
. - Deprecated
address_space::constant_space
. - Marked
sycl::exception::has_context
asnoexcept
. - Improved range reduction performance on CPU.
- Made
sycl::exception
nothrow
copy constructible. - Marked
has_property
methods asnoexcept
. - Improved
sycl::event::get_profiling_info
exception message whenevent
is default constructed. - Added a diagnostic (in the form of
static_assert
) about kernel lambda size mismatch between host and device. - Updated
pipes
class to throw exceptions if used on the host. - Updated ESIMD Emulator PI plugin to report support for
cl_khr_fp64
extension. - Updated Level Zero plugin to prefer copy engine for memory read/write operations.
- Optimized some memory transfers.
- Enabled event caching in the Level Zero PI plugin.
- Optimized some reductions for
parallel_for
acceptingsycl::range
for discrete GPUs. - Added ability to use descendent devices of context members within that context. Not supported with the OpenCL backend yet.
- Limited allowed argument types for
rol/ror
ESIMD functions to better represent HW capabilities. - Implemented lazy mechanism of setting the context for default-constructed events.
- Improved performance for multi-dimensional accessors with multiple accesses in a kernel.
- Increased max
_Bitint
size to 4096 for FPGA target. - Removed deprecation message for
[[intel::disable_loop_pipelining]]
attribute. - Allowed
__builtin_assume_aligned
to be called from device code. - Improved link step performance when
per_kernel
device code split is used. - Added support for
SYCL_EXTERNAL
ondevice_global
variables. - Updated
__builtin_intel_fpga_mem
to accept more parameters. - Updated
ivdep
attribute to allowsafelen = 0
. - Improved linking with
sycl.lib
on Windows. - Implemented more diagnostics for incorrect
device_global
usages. - Improved library resolution for
libsycl.so
. - Improved diagnostics when linking with mismatched objects.
- Added a warning for floating-point size changes after implicit conversions.
- Made
invoke_simd
convert its argument to appropriate types.
Bug Fixes
- Removed deprecated
kernel::get_work_group_info
. - Removed deprecated
get_native
class method. - Removed support for
intel::fpga_pipeline
attribute. - Added
MAJOR_VERSION
to the name of the SYCL library on Windows. - Removed
sycl::program
class. - Removed
ext::oneapi::reduction
. - Removed deprecated
address_space
enum values. - Removed
event::get
method. - Removed
using namespace experimental
insideext::intel
. - Made intel-specific device info descriptors namespace-qualified.
- Removed deprecated
make_queue
API. - Aligned return types of
sycl::get_native
andinterop::get_native_mem
functions to be in conformance with SYCL 2020 spec. - Aligned
sycl::buffer_allocator
interface with SYCL 2020 spec. - Removed
cl
namespace fromsycl/sycl.hpp
header. - Dropped support for compiling SYCL in less than C++17 mode.
- Many other ABI-breaking changes resulting from internal refactoring.
- When compiling for FPGA, you can now use a system installed with Intel® FPGA PAC D5005 to compile a SYCL application that targets Intel® PAC with Intel® Arria® 10 FX FPGA.
- When compiling for FPGA emulator flow on Windows system, an issue leading to the failure to launch device kernels has been fixed.
- Fixed a compilation issue where it wasn't possible to pass an initializer list for dependency events vector in
queue
shortcuts withoffset
parameter. - Fixed
sycl::get_pointer_device
throwing an exception when it passed a descendent device (sub-device) instead of a root device. - Fixed memory leak happening when kernel bundles are linked.
- Fixed USM free throwing an exception when it passed a context created for a descendent device.
- Fixed a compilation issue when using multi-dimensional
accessor
's subscript operator. - Fixed "definition with the same mangled name" error happening when using multiple buffer reductions in a kernel.
- Fixed a compilation issue with SYCL math built-ins when GCC < 11.1 is used as a host compiler.
- Fixed a compilation issue with SYCL math built-ins (such as
sycl::modf
, for example) not accepting pointers tohalf
. - Fixed an issue with
reduction
s when MSVC is used as the host compiler. - Fixed a compilation issue when fully specialized
sycl::span
is initialized from an array. - Fixed a crash in Level Zero PI plugins caused by specialization constants not being used on the device side, but present in a program.
- Fixed event leak in the Level Zero plugin.
- Fixed an issue with sub-sub-devices in the Level Zero plugin.
- Fixed an issue with incorrect
half
conversion on ESIMD emulator. - Fixed a compilation issue with
abs
ESIMD function. - Fixed some warnings coming out of SYCL headers when compiled in C++20 mode.
- Fixed a compilation issue when using multiple bitwise shift operations in ESIMD.
- Fixed a crash in Level Zero PI plugin, which occurs when the runtime tries to reset a command list that does not have a synchronization fence associated with it.
- Fixed a compilation issue with
sycl::get_native<sycl::backend::ext_oneapi_cuda>(sycl::device)
free function (#6653). - Fixed synchronization issue for explicit dependencies (
depends_on
usage) which is blocked by the host task or host accessor. - Fixed an issue in the Level Zero plugin, which could cause barriers not to be correctly applied for an entire queue.
- Fixed
accessor
so gdb can parse its template parameters correctly. - Fixed uses of common macro names in the implementation's header files.
- Fixed a performance regression related to the command list in the Level Zero backend.
- Fixed cleanup of temporary files produced by unbundling archives.
- Fixed optimizing out
device_global
variables with internal linkage. - Fixed an issue when compiling and linking with different optimization levels that could cause runtime errors.
- Fixed description of
-f[no-]sycl-unnamed-lambda
compiler option. - Fixed an issue when building SYCL programs in Debug mode with
Windows-Clang.cmake
. - Fixed an issue causing incorrect conversions involving unsigned types in ESIMD.
- Fixed a crash in applications containing a mix of unnamed ESIMD and non-ESIMD kernels.
- Fixed an issue when
op[]
was called with a typedef argument under gdb.
Known Issues and Limitations
-
[Fixed in 2023.2.0 release] When compiling with the following options, i.e. Ahead of Time (AOT), and the offload kernel contains print statements, the program will stop with a runtime failure.
-fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device xxx" -fopenmp-device-code-split=per_kernel
- Customers might see "fatal error: 'iostream' file not found" when trying to compile a simple program with Intel® oneAPI DPC++/C++ Compiler on a Linux* machine if matching GNU g++ package is not installed. For further details, please check: fatal error: <C++ header> file not found with Intel® oneAPI DPC++/C++ Compiler.
- This release is not backward compatible with previous releases, which means that existing SYCL applications won't work with the newer runtime without re-compilation.
- There is a potential for incorrect results using OpenMP pragmas to offload to Intel GPUs where a parallel loop nested inside a TEAM construct is using a variable in a REDUCTION clause and the TEAM construct does not have the same REDUCTION clause. To avoid incorrect results, compile with
-mllvm -vpo-paropt-atomic-free-reduction-slm=true
to disable global memory buffers. - There is a known issue with using opt-reports with programs containing OpenMP loop constructs with "schedule(dynamic)", which may cause the compiler to emit an error. In this case, it is recommended that the user remove -qopt-report from their compilation.
- Intel® oneAPI DPC++ Compiler 2023.0.0 may not include all the latest functional and security updates. A new version of Intel® oneAPI DPC++/C++ Compiler is targeted to be released by March 2023 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
- If your design has nested loops and data is carried across the loops, you should run simulation to verify that the output is correct. In very rare circumstances, functional issue when you have nested loops and data is carried across the loops, the RTL generated by the compiler is functionally incorrect. If there are any errors in the simulation output, you might be affected by this issue. You can work around the issue by removing the loop nest either by using the loop-coalesce attribute, or manually changing the code. This issue is scheduled to be fixed in a future version of oneAPI.
- If you use SUSE15 U3, SUSE15 U3 and include <complex.h> header, you might run into an error: "expanded from macro 'I'". It is a problem with SYCL headers with <complex.h> which should define macro ‘I’ (https://en.cppreference.com/w/c/numeric/complex/I) but the identifier ‘I’ is widely used in SYCL headers. The reason why it appears on SUSE15 U3 but not other OS is because the provided C/C++ headers may vary between different OS.
- SYCL built-in group algorithms may produce wrong results on CPU or FPGA emulator devices if all of the following conditions are met:
- The work-group size on the highest dimension is larger than the sub-group size
- The group algorithm is applied to the work-group
- The group algorithm produces the same result for all work items in the work group (e.g. all_of_group, any_of_group, group_broadcast, reduce_over_group)
- The group algorithm is used in a loop, and the result may change due to input changes. For example, the following kernel code would produce wrong results (the while loop may not exit or acc[gid] may not be set for all work items due to the known issue):
cgh.parallel_for( sycl::nd_range<1>(8, 8), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { // work-group size > sub-group size bool predicate = true; int gid = item.get_global_id(0); while (sycl::all_of_group(item.get_group(), predicate)) { // applying all_of_group to the work-group // and all_of_group is expected to produce same result for all work-items in the group // and is used inside a loop acc[gid] = 1; predicate = false; // the result of all_of_group would change on the second loop iteration because predicate is changing } });
The workaround is to set the work-group size equal to the sub-group size.
- SYCL 2020 barriers show worse performance than SYCL 1.2.1 do.
- It requires explicit linking against
lib/libsycl-fallback-cassert.o
orlib/libsycl-fallback-cassert.spv
when using fallback assert in a separate compilation flow. - Limit alignment of allocation requests at 64KB, which is the only alignment supported by Level Zero.
- On the following scenario on Level Zero backend:
- Kernel A, which uses buffer A, is submitted to queue A.
- Kernel B, which uses buffer B, is submitted to queue B.
queueA.wait()
.queueB.wait()
.
DPCPP runtime is used to treat unmap/write commands for buffer A/B as host dependencies (i.e., they were waited for before enqueueing any command that's dependent on them). This allowed the Level Zero plugin to detect that each queue is idle on steps 1/2 and submit the command list immediately. This is no longer the case since we started passing these dependencies in an event waitlist, and the Level Zero plugin attempts to batch these commands, so the execution of kernel B starts only on step 4. The workaround restores the old behavior in this case until this is resolved.
- User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e., an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
- A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
- The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
- Using
sycl::program
/sycl::kernel_bundle
API to refer to a kernel defined in another translation unit leads to undefined behavior - Linkage errors with the following message:
error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined
can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and the user specifies-std=c++14
or/std:c++14
. - Printing internal defines is not supported on Windows.
- The usage of new -ax (auto cpu dispatch) is not currently supported when building libraries with -fpic option.
- /Fo<file or dir/> flag no longer accepts directory arguments. Using this flag will result in an error message: clang-offload-bundler command failed with exit code 1. Fix is not available in this release.
- Having MESA OpenCL implementation, which provides no devices on a system, may cause incorrect device discovery. As a workaround, such an OpenCL implementation can be disabled by removing
/etc/OpenCL/vendor/mesa.icd
. - Compilation may fail on Windows in debug mode if a kernel uses
std::array
. This happens because debug version ofstd::array
in Microsoft STL C++ headers calls functions that are illegal for the device code. As a workaround, the following can be done:- Dump compiler pipeline execution strings by passing
-###
option to the compiler. The compiler will print the internal execution strings of compilation tools. The actual compilation will not happen. - Modify the (usually) first execution string (it should have
-fsycl-is-device
option) by adding-D_CONTAINER_DEBUG_LEVEL=0 -D_ITERATOR_DEBUG_LEVEL=0
options to the end of the string. Execute all string one by one.
- Dump compiler pipeline execution strings by passing
-fsycl-dead-args-optimization
cannot eliminate the offset of the accessor even though it is created with no offset specified.- SYCL 2020 barriers show worse performance than SYCL 1.2.1 do.
- When using fallback assert in a separate compilation flow, it requires explicit linking against
lib/libsycl-fallback-cassert.o
orlib/libsycl-fallback-cassert.spv.
- Limit alignment of allocation requests at 64KB, which is the only alignment supported by Level Zero.
- On the following scenario on Level Zero backend:
- Kernel A, which uses buffer A, is submitted to queue A.
- Kernel B, which uses buffer B, is submitted to queue B.
queueA.wait()
.queueB.wait()
. DPCPP runtime is used to treat unmap/write commands for buffer A/B as host dependencies (i.e. they were waited for before enqueueing any command that's dependent on them). This allowed the Level Zero plugin to detect that each queue is idle on steps 1/2 and submit the command list immediately. This is no longer the case since we started passing these dependencies in an event waitlist and the Level Zero plugin attempts to batch these commands, so the execution of kernel B starts only on step 4. The workaround restores the old behavior in this case until this is resolved.
- User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e. an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
- A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through
device.get_info<>()
also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information. - The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
- Using
sycl::kernel_bundle
API to refer to a kernel defined in another translation unit leads to undefined behavior - Linkage errors with the following message:
error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined
can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies-std=c++14
or/std:c++14
. - Printing internal defines isn't supported on Windows.
- The compile times can be significant when compiling for FPGA and using a read-only accessor for a very wide struct. As a workaround, use a read-write accessor instead to address long compile times.
- When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
icpx -fsycl -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o icpx -fsycl -fintelfpga <other arguments> -Xshardware -kernel.o
- When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
-
The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.
-
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp). -
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.
-
When compiling for FPGA, the compiler might produce a different intermediate representation (IR) on Windows than Linux. Misaligned structs cause this issue. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows. As a workaround, force an alignment on the misaligned structs, as shown in the following example:
//Code with misaligned struct struct Item { bool valid; int value1; unsigned char value2; }; //Forced alignment of the struct struct Item { bool valid; bool __empty__[3]; int value1; unsigned char value2; unsigned char __empty2__[3]; }
-
The FPGA emulator does not recognize different Avalon interfaces when defining a host pipe. This can lead to unexpected behavior when specifying the Avalon interface type. There is no known workaround for this issue.
-
When compiling for FPGA and trying to reduce the II of the II-critical path, the scheduler may return an incorrect II-critical path. This means the compiler reduces the II of the wrong path, and the II goal is not achieved. You might observe this issue only when there are multiple negative cycles in the LSU's critical path. There is no known workaround for this issue. However, your design’s functionality stays unaffected. Performance (QoR) might get degraded slightly.
-
When simulating FPGA designs, a design with a host channel might pose two signal mismatch errors—dataBitsPerSymbol and firstSymbolInHigh OrderBits:
-
dataBitsPerSymbol error can occur in the FPGA IP authoring flow when you specify a dataBitsPerSymbol value that is not equal to 8. As a workaround, set the dataBitsPerSymbol to 8.
-
firstSymbolInHigh OrderBits error can occur in the FPGA IP authoring flow when you set firstSymbolInHigh OrderBits to false. As a workaround, set the firstSymbolInHigh OrderBits to true.
-
-
With the FPGA IP Authoring flow, you can intuitively integrate your design into the Platform Designer by copying the generated .prj folder into your Intel® Quartus® Prime project directory. The Platform Designer detects the project automatically. However, there is a known issue with the generated hw.tcl file, which is not mapping the signals correctly. To work around this issue, follow these steps on both Linux and Windows systems:
$ cd <kernel_name>.prj $ python <kernel-name>_di_hw_tcl_adjustment_script.py
-
Add python to your PATH environment variable to run python from your command line.
-
Execute the following commands to run the <kernel-name>_di_hw_tcl_adjustment_script.py python script generated in your .prj directory before integrating your IP authoring kernel into the Platform Designer:
-
-
When compiling an FPGA kernel that calls the sycl::ext::oneapi::experimental::printf() function, the compiler issues the following warning message:
compiler warning: argument 'llvm_fpga_printf_buffer_start' on component '<your kernel name>' is never used by the component. Note that the compiler may optimize it away.
There is no known workaround for this issue. However, you can ignore this warning since it does not impact the kernel’s functionality. -
When compiling for FPGA, if your SYCL code contains the std::popcount function inside a fixed-size loop (bit-widths not in 8, 16, 32, or 64), it gets mapped directly into llvm.ctpop, and the compilation fails with an error message. There is no known workaround for this issue. However, Intel recommends avoiding the use of the std::popcount function inside loops.
-
On the Windows system, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display "GPU process launch failed" error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool application from the mapped network drive to your local computer and run it locally.
-
The Intel FPGA IP authoring encryption flow is not fully supported on Windows systems.
-
In the Intel FPGA IP authoring flow, the fpga_tools::UnrolledLoop utility defined in the unrolled_loop.hpp code sample header file does not support the kernel argument interface macros (mmhost, conduit_mmhost, and register_map_mmhost). For example:
fpga_tools::UnrolledLoop<ROWS>([&](auto row) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; });
As a workaround, use the #pragma unroll before a for loop, as shown in the following example:
#pragma unroll for (int row = 0; row < ROWS; row++) { #pragma unroll for (int i = COLS - 1; i > 0; i--) { shift_reg[row][i] = shift_reg[row][i - 1]; } shift_reg[row][0] = MA[col * ROWS + row]; }
System Requirements
Additional Documentation
- Get Started with the Intel® oneAPI Toolkits for Linux*
- Get Started with the Intel® oneAPI Toolkits for Windows*
- OneAPI Versioning Schema based on Semantic Versioning
- Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
- Intel® oneAPI Programming Guide
- SYCL* 2020 Specification Features and DPC++ Language Extensions Supported
-
OpenMP* Features and Extensions Supported in Intel® oneAPI DPC++/C++ Compiler
Previous oneAPI Releases
Notices and Disclaimers
Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.
Intel technologies may require enabled hardware, software, or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.