FPGA Support Package for the Intel® oneAPI DCP++/C++ Compiler 2024 Release Notes

ID 831544
Updated 7/31/2024
Version 2024
Public

author-image

By

Where To Find the FPGA Support Package

The FPGA Support Package for the Intel oneAPI DPC++/C++ Compiler requires the Intel® oneAPI DPC++/C++ Compiler as provided by the Intel® oneAPI Base Toolkit (Base Kit). Visit the FPGA Support Package for Intel® oneAPI DPC++/C++ Compiler website to download the toolkit and the FPGA support package.

Both the base toolkit and the FPGA support package are required for FPGA design flows. 

Supported Hardware and Operating System

See Intel® oneAPI DPC++/C++ Compiler System Requirements.

FPGA Support Package Release Notes

2024.2.1 Bug Fixes

The FPGA Support Package for the Intel® oneAPI DPC++/C++ Compiler 2024.2.1 fixes the following issues:

  • Fixed the known issue where the compiler might crash when you use any variant of the ivdep annotation (for example, [[intel::ivdep]], [[intel::ivdep(array)]], [[intel::ivdep(safelen)]], or [[intel::ivdep(array, safelen)]]) on a loop.
  • Fixed the known issue where code that contains aggregate struct data types (such as fpga_tools::NTuple) can sometimes cause the compiler to crash.

These issues were also fixed earlier in a separate patch available at https://www.intel.com/content/www/us/en/support/programmable/articles/000099101.html.

2024.2.1 Known Issues and Limitations

Refer to the Known Issues and Limitations list for oneAPI 2024.2.

2024.2 New and Changed Features

  • Using the Intel® oneAPI DPC++/C++ Compiler to compile for FPGA targets now requires you download and install an additional software package: FPGA Support Package for the Intel® oneAPI DPC++/C++ Compiler (often referred to as "FPGA Support Package"). The compiler, as provided with the Intel® oneAPI Base Toolkit, does not support FPGA compilation.
  • You can now apply the [[intel::fpga_register]] attribute to device_globals variables.
  • You can now use the fpga_datapath<> template class for local (kernel) scoped memories.
  • USM allocation functions can now read the buffer_location from the annotated_ptr defined in the kernel code.
  • Support for Quartus® Prime software versions 21.4 and earlier is deprecated and will be removed in a future release. Migrate your designs to use Quartus® Prime software versions 22.3 or later.
  • When compiling for FPGA devices, the -fsycl compiler command option is no longer required. The option is included when you specify the -fintelfpga compiler command option.
  • If you use Quartus® Prime software versions 21.2 or earlier, the FPGA simulation flow is not supported. If you need the simulation flow, use Quartus® Prime software versions 22.3 or later.
  • Added the -Xsallow-wide-device-globals compiler option to allow wider internal word widths when compiling your kernel.
  • The fpga_crossgen command has the following changes:
    • The --emulation_model command option is removed. Use the --cpp_model command option instead.
    • The --source command option is deprecated and will be removed in a future release.
    • The --target command option is deprecated and will be removed in a future release.
  • The --target option of the fpga_libtool command is deprecated and will be removed in a future release.

2024.2 Bug Fixes

  • The <Kernel>_REGISTER_MAP_ARG_<arg>_REG macro provided in <Kernel>_register_map.h. now correctly provides the byte address of the argument in the CSR. Previously, the address given was the byte address of the register containing the argument (that is, its byte address rounded down to the word).

2024.2 Known Issues and Limitations

  • On Red Hat Linux Enterprise Linux systems, you must install the libnsl library before you run the compiler. To install the libnsl library, run the sudo yum install libnsl command.

  • The -Xsallow-wide-device-globals compiler option is not supported in emulation.

  • On Microsoft* Windows* systems, using the %ld format specifier with the printf function generates show incorrect results. As a workaround, use the %lld format specifier.

  • When targeting the BSP for the Intel® FPGA SmartNIC N6000-PL Platform (formerly code-named Arrow Creek), the following issues apply:

    • Compiling using -Xsfast-compile compiler option fails. Do not use the -Xsfast-compile compiler option when compiling with the N6000-PL Platform BSP.

    • When compiling a N6000-PL Platform FPGA hardware image, the Quartus® Summary section of the FPGA Optimization Report is not populated. You can view the Quartus® compilation results directly in Quartus® Prime software.

    • The compiler does not automatically detect and warn about timing violations occurring when compiling with BSPs or custom platforms based on the Open FPGA Stack.  If you target one of these platforms,  validate that your timing passed before executing the compiled design on hardware. When targetting the N6000 PL Platform BSP, you can validate your timing by reviewing the clock report in fim_platform/build/syn/board/n6001/syn_top/output_files/timing_report subdirectory inside of your compiler output directory (that is, your .prj folder).

  • An FPGA hardware compilation that targets the default Stratix® 10 device (-Xstarget=Stratix10) might fail with the following error message:

    Error (22730): RAM Primitive "foo_di_inst|DotProductIP_std_ic_inst|DotProductIP_inst_0|kernel|theDotProductIP_function|thebb_DotProductIP_B1|<other names>|ram_block2a63" parameter operation_mode value QUAD_PORT is no longer supported for the target device. File: <filename>/altera_syncram_impl_klrp.tdf Line: 37

    You can use one of the following workarounds for this issue:

    • Specify a specific Stratix* 10 OPN with the -Xsdevice= compiler command option. Do not use the 1SG280LU3F50I2VG OPN.

    • If you must use the 1SG280LU3F50I2VG Stratix® 10 OPN, run the following commands when compiling:

      echo "skip_nd_sqp_power_temp_check=on" > /tmp/file.ini
      icpx -Xsadd-ini=/tmp/file.ini <other compiler command options>
  • If your design uses ac_int variables larger than 256 bits, your host program might crash at run time. As a workaround, compile your design with the -O0 compiler option.

  • Writing an argument to a CRA interface one cycle after the start signal is sent incorrectly causes the arguments after the start signal to be used.

  • In some cases, applying the [[intel::fpga_register]] attribute to a variable can cause the compiler to crash with the following error message:

    Assertion `isAllocaPromotable(AI) && "Cannot promote non-promotable alloca!"' failed

    In this case, remove the [[intel::fpga_register]] attribute from the variable. If you have applied the [[intel::fpga_register]] attribute to multiple variables, you need to experiment to determine the problematic variable. The compiler crash message does not indicate which variable is problematic.

  • If your code contains an assert macro that contains error message text, you might experience a compiler crash that includes an Error: Optimizer FAILED message as part of the crash message. As a workaround, avoid using text strings in your assert macros.

  • If your code reads a host-access device_global variable but never uses the result, your program will crash with an Error: Verilog generator FAILED error message as part of the crash message. To prevent this error, ensure that your code uses the result of the host-access device_global variable read elsewhere in the program.

  • This issue is fixed in oneAPI 2024.2.1:
    Code that contains aggregate struct data types (such as fpga_tools::NTuple) can sometimes cause the compiler to crash. The compiler gives you an error that contains the following text:

    Use still stuck around after Def is destroyed

    If you encounter such a crash, try the following techniques:

    • Instead of using wrapper types for memory operations, try using native implementations. For example, instead of using fpga_tools::NTuple try using std::array.

    • If using a native types fails, try splitting up aggregate types manually in source code. For example, if you have a struct whose member variables are aggregate types (such as arrays or structs), try splitting those member variables up so that they are on their own.

            For a fix for this issue, refer to https://www.intel.com/content/www/us/en/support/programmable/articles/000099101.html.

  • When a USM pointer is allocated but not used as a kernel argument, it is legal to free it without waiting for kernel finish. However, currently, freeing such pointers without waiting for kernel completion results in a segmentation fault error with the following backtrace:

    #0  0x0000xxxxxxxxxxxx in clMemBlockingFreeINTEL () from libalteracl.so
    #1  0x0000xxxxxxxxxxxx in urUSMFree () from libpi_opencl.so
    #2  0x0000xxxxxxxxxxxx in piextUSMFree () from libpi_opencl.so
    #3  0x0000xxxxxxxxxxxx in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)97, _pi_context*, void*>(_pi_context*, void*) const () from libsycl.so.7
    #4  0x0000xxxxxxxxxxxx in sycl::_V1::detail::usm::free(void*, sycl::_V1::context const&, sycl::_V1::detail::code_location const&) () from libsycl.so.7
    #5  0x0000xxxxxxxxxxxx in sycl::_V1::free(void*, sycl::_V1::queue const&, sycl::_V1::detail::code_location const&) () from libsycl.so.7
    #6  0x0000xxxxxxxxxxxx in main () at xxx.cpp:xx

    To work around this issue, always wait for kernel completion by calling <kernel_event>.wait() before freeing any USM pointers, whether they are used in the kernel or not.

  • In the FPGA acceleration flow, if you use host pipes, you might receive the following warning message from the compiler JIT engine:

    libunwind: __unw_add_dynamic_fde: bad fde: FDE is really a CIE

    This message can be safely ignored. This spurious warning is planned to be addressed in a future release.

  • This issue is fixed in oneAPI 2024.2.1:
    In some rare cases, the compiler might crash when you use any variant of the ivdep annotation (for example, [[intel::ivdep]], [[intel::ivdep(array)]], [[intel::ivdep(safelen)]], or [[intel::ivdep(array, safelen)]]) on a loop. A back trace of this crash contains the following line:

    llvm::AccessGroupInfo::hasIndependentAGsOrSafelen(llvm::Loop*, llvm::Instruction*, llvm::Instruction*, int&) + 195

    For a fix for this issue, refer to https://www.intel.com/content/www/us/en/support/programmable/articles/000099101.html.

  • If your design uses sycl::ext::intel::prototype::pipe, you might receive the following compiler error even the pipe is specified with a protocol that includes a ready signal:

    compiler error: invalid pipe property on pipe. host-to-device <name> cannot omit a ready signal.

    To avoid this error, convert your use of the protype pipes to the equivalent sycl::ext::intel::experimental:pipe.

  • The IO pipe classes sycl::ext::intel::kernel_readable_io_pipe and sycl::ext::intel::kernel_writable_io_pipe are not compatible with the pipe properties defined in the sycl::ext::intel::experimental namespace. This is planned to be addressed in a future release.

  • The SYCL* ext::oneapi::experimental::printf class is subject to the following limitations:

    • In the Windows emulator, output might be reordered. If you see different order of output between printing to the console and redirecting output to a file, recompile your program with the -O0 compiler option.

      For example, the following code generates different output order:

      #include <sycl/sycl.hpp>
      #include <sycl/ext/intel/fpga_extensions.hpp>
      
      #ifdef __SYCL_DEVICE_ONLY__
      #define CL_CONSTANT __attribute__((opencl_constant))
      #else
      #define CL_CONSTANT
      #endif
      
      using namespace sycl;
      #define PRINTF(format, ...)                          \
        {                                                  \
         static const CL_CONSTANT char _format[] = format; \
         ext::oneapi::experimental::printf(_format,##__VA_ARGS__); \
        }
      
      class BasicKernel;
      
      int main(int argc, char* argv[]) {
        queue q;
          q.submit([&](handler& h) {
             h.single_task<BasicKernel>([=]() {
               PRINTF("Result1: Hello, World!\n");
               PRINTF("Result2: %%\n");
             });
           }).wait();
        return 0;
      }

      On Windows, this program prints the following output to the console:

      Hello, World!
      Result2: %

      If you redirect the output to a file, the program creates the following results:

      Result2: %
      Hello, World!
    • Calling ext::oneapi::experimental::printf with a float, char or short value might not print the correct value. To work around this issue, compile your program with the  ‑D__SYCL_USE_VARIADIC_SPIRV_OCL_PRINTF__ compiler option.

    • On Microsoft* Windows* systems, using the %ld and %lu format specifiers with the ext::oneapi::experimental::printf function with a long or unsigned long value may show incorrect results because long and unsigned long values are 32-bit values on Windows* systems. As a workaround, when printing long and unsigned long values, use the %d or %u format specifier, or cast the value to int64_t or uint64_t.

  • The FPGA runtime can hang when multiple invocations of the same kernels are enqueued with explicit event dependences between them. As a workaround, remove the explicit event dependences. This workaround is safe for FPGA devices, but generally is not safe for CPU/GPU compiler targets. 

  • If your design uses macros defined in #include <sycl/ext/intel/prototype/interfaces.hpp> and has a device_global variable with the [[intel::fpga_register]] attribute applied, you can experience a crash with the following error shown:

    aocl-llvm-spirv: ../../../source/acl/llvm-project/llvm/lib/IR/Constants.cpp:1256: static llvm::Constant* llvm::ConstantArray::getImpl(llvm::ArrayType*, llvm::ArrayRef<llvm::Constant*>): Assertion `C->getType() == Ty->getElementType() && "Wrong type in array element initializer"' failed.

    The macros in #include <sycl/ext/intel/prototype/interfaces.hpp> are deprecated as follows:

    • Instead of the mmhost macro, use the annotated_arg class.

    • The register_map_interface macro is not required as a memory-mapped interface is now the default.

    • Instead of the streaming_interface macro, use the streaming_interface kernel property.

  • On Windows, if you link a static library (.a file) containing your main function produced via the -fsycl-link=image flag you might see linker errors such as the following errors:
    • error LNK2001: unresolved external symbol __start_omp_offloading_entries
      error LNK2001: unresolved external symbol __stop_omp_offloading_entries
    • error LNK1561: entry point must be defined
      
    • As a workaround split your source code across multiple files so that the translation unit you compile with -fsycl-link=image does not contain your main function.
  • Programs that use the fpga_datapath template class may experience the following crash:
    aocl-opt: ../../../source/acl/llvm-project/llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp:840: void {anonymous}::PromoteMem2Reg::run(): Assertion `isAllocaPromotable(AI) && "Cannot promote non-promotable alloca!"' failed
    This crash typically occurs when you user read from an external memory and stores the result directly to an internal variable marked with the fpga_datapath template class. For example:
    sycl::ext::intel::experimental::fpga_datapath<int[1]> a_val;
    a_val[0] = vec_a_in[idx];
    A potential workaround for this is to load the value from external memory into a temporary variable first before storing to the fpga_datapath variable. Alternatively, use the [[intel::fpga_register]] attribute for the same performance and functionality without having to create a temporary storage variable.
    Workaround:
    sycl::ext::intel::experimental::fpga_datapath<int[1]> a_val;
    int temp = vec_a_in[idx];
    a_val[0] = temp
  • Designs that access internal memory in a series of nested loops might experience inefficient memory accesses (that is, stallable loads and stores) regardless of the memory attributes specified to set the memory bank configuration. A potential workaround for this is to transpose the memory system so that the lowest dimension is accessed in parallel.

    The following example experience inefficient memory accesses:

    [[intel::fpga_memory("BLOCK_RAM")]]  // memory
    unsigned int line_buffer[8][COLS];
    ...
    for (int num_col = 0; num_col < COLS ; num_col++) {
        fpga_tools::UnrolledLoop<0, 4>([&](auto l) {  // loop
            line_buffer[l][num_col] = line_buffer[l + 1][num_col];
        });
        line_buffer[4][num_col] = pixel_a_traiter;
        fpga_tools::UnrolledLoop<0, 5>([&](auto li) {
            fpga_tools::UnrolledLoop<0, 4>([&](auto co) {  // loop
                fenetre[li][4] = line_buffer[li][num_col];
            });
        });
    }

    Implement the workaround changes the code into the following example:

    [[intel::fpga_memory("BLOCK_RAM")]]  // memory
    unsigned int line_buffer[8][COLS];
    ...
    for (int num_col = 0; num_col < COLS ; num_col++) {
        fpga_tools::UnrolledLoop<0, 4>([&](auto l) {  // loop
            line_buffer[num_col][1] = line_buffer[l + 1][num_col];
        });
        line_buffer[4][num_col] = pixel_a_traiter;
        fpga_tools::UnrolledLoop<0, 5>([&](auto li) {
            fpga_tools::UnrolledLoop<0, 4>([&](auto co) {  // loop
                fenetre[li][4] = line_buffer[li][num_col];
            });
        });
    }

     

  • For FPGA devices, you might run into performance issues when using switch statements instead of if statements. If the cases of the switch statement access external memories at different buffer_locations then the compiler might not be able to resolve the address space to the loads and stores, which results in the creation of extra loads and stores to dynamically resolve the address space at run time.

    If you encounter these issues, use if statements instead of switch statements.

  • The atomic_ref class is not supported for FPGA devices.

  • In the FPGA SYCL* HLS flow, the compiler might generate a wider than requested address bus for the Avalon MM Host interfaces when the ring interconnect is used to connect the LSUs. You can ignore the extra MSBs on the bus by leaving them unconnected.

  • Designs with host pipe reads and writes in an unrolled loop cause a compiler error message that contains text similar to the following text:

    …pipe 'acl_c_MyID_pipe_channel' must be accessed from both endpoints…

    If you receive this error message, unroll the loop manually to resolve this error.
    When compiling for emulation, you might not receive an error message for this issue.

  • When you use the -fsycl-device-code-split=per_kernel compiler command option for a design that launches and collects multiple kernels, the first kernel that is returned provides correct results. However, subsequent kernels may intermittently return incorrect results. 

  • In the FPGA SYCL* HLS flow, when you compile to hardware with Quartus® Prime Version 22.2 or earlier and your design has more than 32 kernels per device image, you will hit a failure in qsys-generate.
    In later versions of Quartus® Prime software, this limit is increased to 2048 kernels per device image.

  • For FPGA pipelined kernels in simulation, the reported II may not reflect the lowest II achievable by the hardware because the runtime cannot feed data to the simulator fast enough. One possible workaround, which allows lower II to be achieved, is to use pipelined kernels with streaming arguments only. If wall clock time is not a restriction, using the ‑Xsghdl=0 compiler command option should slow down the simulator sufficiently for the runtime to feed it data at the lowest achievable II.

  • When applying memory attributes such as the [[intel::fpga_register]] attribute to member variables of structs, you might get an error message similar to the following example error message:

    Unable to implement variable "var" in registers as requested since the compiler was not able to break up the struct in which this member variable with the register attribute resides.

    You can avoid this error message in the following ways:

    • Break up the struct or move the member variables that you want to apply the attribute to outside of the struct definition.

    • If you want all member variables of the struct implemented in registers, apply the [[intel::fpga_register]] attribute to the instantiation of the struct instead of the struct definition.

    • Avoid default initializing struct or class member variable arrays (for example, int arr[10] = {}). Instead, manually initialize the array in a struct or class member variable (in internal memory) that is initialized., and always access the array with square bracket notation (that is, use arr[i] instead of *(arr + i)).

  • Converting an ap_float number to an ac_fixed data type in SYCL device code in the form of ApFloatT x = (AcFixedT) y; may produce incorrect results in the FPGA emulation flow. This type of conversion works correctly in FPGA simulation and hardware compilation flows. 

  • A DPC++ system that has FPGAs installed does not support multiprocess execution.

  • A DPC++ program that runs kernels on one or more FPGA devices does not support multithreaded execution.

    This lack of support can be particularly problematic when you create host code to test a streaming kernel (that is, a kernel that continually reads input from a pipe, does some computation, and writes output to another pipe).

    The typical way to express such a testbench for such a streaming kernel is to use one thread to write to the kernel input pipe while another thread reads from the kernel output pipe. However, such multithreaded execution of the host program is not supported.

    If you use the same thread to write to the kernel input pipe and read the kernel output pipe, your SYCL program might hang if the capacity of the pipes is exceeded (for example, if you writer more data to the kernel input pipe than the pipe capacity). You can avoid such a hang without needing to increase pipe capacity by applying the following idiom:

    struct my_kernel {
      void operator()() const {
        while(1) {
          auto in = in_pipe::read();
    
          out_pipe::write(...);
        }
      }
    };
    
      // Host code
      int in_count= 0;
      int out_count = 0;
      q.single_task(my_kernel{});
      while (out_count < N) {
        bool success;
        if (in_count < N) {
          in_pipe::write(q, 1, success);
          in_count += success;
        }
        out_pipe::read(q, success);
        out_count += success;
      }
    

     

  • For Stratix® 10 FPGA reference boards, a rare failure can occur when initializing internal memory where the memory is initialized into an unknown state that can cause unexpected behavior.

    As a workaround, compile your design with the -Xsbsp-flow=flat compiler option to avoid this issue.

  • For large FPGA simulations, such as those that target Agilex™ 7 boards, you might receive a linker error that contains a PC-relative offset overflow message.

    If you receive this message, compile your simulation with the -fsycl-link-huge-device-code compiler command option.

  • For FPGA devices, channel widths are limited to 4096 bits in Quartus® Prime Version 21.4 and earlier. If you exceed the channel width limitation, you receive an error like the following message:
     

    <name>_pipe_channel_read: dataBitsPerSymbol 5120 is out of range: 1-4096

    To avoid this error, ensure that you use Quartus® Prime Version 22.1 or later.

  • In the FPGA Optimization Report, designs with multiple lambda kernels will report inaccurate results unless the lambda kernels are all given unique names.
    For information about how to name lambda kernels, refer to "Suggested Coding Styles" in the Intel oneAPI FPGA Handbook.
    For lambda kernels generated in a loop, use templated classes to give the kernels procedurally generated names.

  • For ap_float data types, the ‑fp-model=fast compiler command option does not enable dot product inference. There is currently no workaround for this issue.

  • For FPGA, counting the leading zeros of an unsigned native integer type using a loop like in the following example can lead to a compiler error such as Compiler Error: undefined reference to 'llvm.ctlz.iN'

    unsigned int leading_zeros = 0;
    while (number) {
      leading_zeros += 1;
      number >>= 1;
    }

    You can workaround this issue by using the built-in function to count the leading zeros: __builtin_clz(unsigned) or __builtin_clzll(unsigned long long). When counting the leading zeros of unsigned char or unsigned short using the built-in functions, deduct the number of bits extended during type conversion from the return.

  • On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error: 

    dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
    NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'

    As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).

  • When compiling for FPGA, the compiler might pack structs differently on Windows than on Linux. This difference can result in structs with members that might not be well-aligned for optimal memory accesses. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows.

    As a workaround, force an alignment on the misaligned structs, as shown in the following example: 

    //Code with misaligned struct
    struct Item {
      bool valid;
      int value1;
      unsigned char value2;
    };
    
    //Forced alignment of struct
    struct Item {
      bool valid;
      bool __empty__[3];
      int value1;
      unsigned char value2;
      unsigned char __empty2__[3];
    }
  • Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, copy the FPGA optimization reports directory to a local file system and launch it using a supported browser. 

  • You might encounter functional failures in the FPGA emulation flow when resetting a device_global and a new device_image is loaded without the device_image scope property. Currently, there is no known workaround for this issue.

Intel® oneAPI DPC++/C++ Compiler Known Issues

For Intel® oneAPI DPC++/C++ Compiler related issues, refer to Intel® oneAPI DPC++/C++ Compiler Release Notes.

Code Samples

Download the oneAPI samples for FPGAs available on GitHub at oneAPI Samples for FPGA.

Notices and Disclaimers

Intel® technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Currently, characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.