Visible to Intel only — GUID: GUID-5F31E5C3-7C85-4AED-8840-5D71347E6362
Visible to Intel only — GUID: GUID-5F31E5C3-7C85-4AED-8840-5D71347E6362
Just-In-Time Compilation
The Intel® oneAPI SYCL Compiler converts a SYCL program into an intermediate language called SPIR-V and stores that in the binary produced by the compilation process. The advantage of producing this intermediate file instead of the binary is that this code can be run on any hardware platform by translating the SPIR-V code into the assembly code of the platform at runtime. This process of translating the intermediate code present in the binary is called JIT compilation (Just-In-Time compilation). JIT compilation can happen on demand at runtime. There are multiple ways in which this JIT compilation can be controlled. By default, all the SPIR-V code present in the binary is translated upfront at the beginning of the execution of the first offloaded kernel.
#include <CL/sycl.hpp> #include <array> #include <chrono> #include <iostream> // Array type and data size for this example. constexpr size_t array_size = (1 << 16); typedef std::array<int, array_size> IntArray; void VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void VectorAdd2(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void InitializeArray(IntArray &a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; } int main() { IntArray a, b, sum; InitializeArray(a); InitializeArray(b); sycl::queue q(sycl::default_selector_v, sycl::property::queue::enable_profiling{}); std::cout << "Running on device: " << q.get_device().get_info<sycl::info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; auto start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); auto end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; return 0; }
When the program above is compiled using the command below (assuming that the name of the source file is example.cpp):
icpx -fsycl -O3 -o example example.cpp
and run, the output generated will show that the first call to VectorAdd1 takes much longer than the calls to other kernels in the program due to the cost of JIT compilation, which gets invoked when vectorAdd1 is executed for the first time.
If the application contains multiple kernels, one can force eager JIT compilation or lazy JIT compilation using compile-time switches. Eager JIT compilation will invoke JIT compilation on all the kernels in the binary at the beginning of execution, while lazy JIT compilation will enable JIT complication only when the kernel is actually called during execution. In situations where certain kernels are not called, this has the advantage of not translating code that is never actually executed, which avoids unnecessary JIT compilation. This mode can be enabled during compilation using the following option:
-fsycl-device-code-split=<value>
where <value> is
per_kernel: generates code to do JIT compilation of a kernel only when it is called
per_source: generates code to do JIT compilation of all kernels in the source file when any of the kernels in the source file are called
off: performs eager JIT compilation of all kernels in the application
auto: the default, the compiler will use its heuristic to select the best way of splitting device code for JIT compilation
If the above program is compiled with this option:
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_kernel
and run, then from the timings of the kernel executions it can be seen that the first invocations of VectorAdd1 and VectorAdd2 take longer, while the second invocations will take less time because they do not pay the cost of JIT compilation.
In the example above, we can put VectorAdd1 and VectorAdd2 in separate files and compile them with and without the per_source option to see the impact on the execution times of the kernels. When compiled with
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_source
and run, the execution times of the kernels will show that the JIT compilation cost is paid at the first kernel invocation, while the subsequent kernel invocations do not pay the JIT compilation cost. But when the program is compiled with
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp
and run, the execution times of the kernels will show that the JIT compilation cost is paid upfront at the first invocation of the kernel, and all subsequent kernels do not pay the cost of JIT compilation.