Advanced OpenMP* Device Offload with Intel® Compilers

October 30, 2024

Get the Latest on All Things CODE

author-image

By

OpenMP Offload offers a comprehensive and actively evolving accelerator programming framework with multi-vendor, multi-platform, and multi-language support across all the major market players. As a result, OpenMP threading and parallel offload also serve as the backend for several recent and growing Performance Portability Ecosystems, such as Kokkos* and RAJA*. Without modifying the source code, just adding a few lines of OpenMP directives can generate an executable that can take advantage of parallel execution on  GPUs while also still permitting execution on CPU only if desired.

The OpenMP community has actively advanced the specification since the last version, 5.2, was delivered in November 2021. The next OpenMP 6.0 specification is targeted for release in November 2024. The new release provides much more mature support for offload to GPU.

We published the article “Solving Heterogeneous Programming Challenges with Fortran and OpenMP” in April 2023 on Parallel Universe Magazine (PUM) and introduced the fundamentals of OpenMP’s Offload to GPU programming framework.

At Intel, we have a long history of actively contributing to the open source community and specifically to OpenMP. In the latest releases of the Intel® oneAPI DPC++/C++ Compiler and Intel® Fortran Compiler, we are implementing many new features introduced as part of OpenMP 5.2 enhancements and the latest Technical Report (TR) 12 and TR 13 proposals for OpenMP 6.0. This includes support for the latest generation Intel® Arc™ Graphics GPU, Intel® Data Center GPU, and the integrated Intel® Arc™ Xe2 Graphics GPU.

With all the advances in accelerated compute and device offload support in OpenMP, it is time for us to revisit the topic and expand on the previous article to discuss the recent evolution of device offload with OpenMP.

Device Offload Support Continues to Evolve

This article aims to update and extend the previous article, covering advanced features in the latest Intel compilers. These compilers will go public in November 2024 as part of the Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit  2025.0 releases.

A comprehensive and complete GPU programming framework typically includes, but is not limited to, the following 5 aspects:

  1. Data management in heterogeneous memory architecture.
  2. Execution policy/configuration of GPU thread management.
  3. Leveraging existing APIs to GPU-optimized libraries or other compilation units.
  4. GPU instructions selection/optimization.
  5. Control flow/branch control of concurrent thread execution.

In our previous PUM article, we covered the OpenMP basic features for items 1, 2, and 3 (4 and 5 are not in the scope of discussion here), plus the Intel OpenMP offload implementation of Fortran-based language parallelism (do concurrent feature) without explicit directive assistance. This article provides updates for 6 additional advanced OpenMP features in our upcoming compiler release.

These features are:

  1.  GROUPPRIVATE directive (in OpenMP 6.0) for data management on GPU shared local memory.
  2. LOOP directive,
  3. REDUCTION clause on TEAMS directive,
  4. NOWAIT clause on TARGET directive (in OpenMP 5.1) for execution policy.
  5. INTEROP clause on DISPATCH directive (in OpenMP 6.0) for leveraging existing APIs and enabling SYCL* and OpenMP interoperability.
  6. Enhanced Support of “do concurrent” auto offload for Fortran standard language parallelism: reduce operation, implicit map-type-modifier “present”, improved thread launching heuristic.

1. GROUPPRIVATE Directive 

The OpenMP data environment directives and clauses handle these. As we covered in “Solving Heterogeneous Programming Challenges with Fortran and OpenMP”, the data map directive and its variants handle the host-device data movement. For device memory usage, there is a set of directives and clauses related to private.  

As part of the OpenMP specification update from 5.2 to 6.0, the groupprivate directive was added to specify that variables should be privatized with respect to a contention group. The groupprivate directive specifies the variables are replicated such that each contention group receives its own copy. Each copy of the variables is uninitialized upon creation. The lifetime of a groupprivate variable is limited to the team's lifetime. If a variable is specified in a groupprivate directive in one compilation unit, it must be specified in a groupprivate directive in every compilation unit in which it is declared. The groupprivate variable must have static storage assigned in C/C++ and the "save" attribute set in Fortran.

The following is an example:

program groupprivate_example
  use omp_lib
  integer, save  :: x
  !$omp groupprivate(x)
  !$omp target teams num_teams(4)
  x = omp_get_team_num()
 print *, x
  !$omp end target teams
end program

2. LOOP Directive

The loop directive was introduced in OpenM P5.1, which specifies that the collapsed iterations execute in the context of the binding thread set. When using the loop directive in a target region, the compiler transforms it to “distribute” and “parallel do / parallel for” instead. The loop can form various combined directives with target, teams, and parallel. The following is an example.

As defined in the latest OpenMP specification, LOOP is a worksharing directive to pair with the parallelism generation directive in an implementation-dependent way. A loop directive specifies that the collapsed iterations execute in the context of the binding thread set. In the Intel implementation, LOOP is an abstracted directive, which will be transformed to a proper combined directive during compilation. Programmers can check the actual transformed combined directive from the optimization report. In this simple nested loop example, the outer “!$omp loop” should transform to “!$omp distribute”; the inner “!$omp loop” should transform to “!$omp parallel do”.

program loop_directive
        implicit none
        integer :: i,j,bo,bi
        real, allocatable :: A(:,:), B(:,:)
        bo=100 ; bi=200
        allocate(A(bo,bi))
        allocate(B(bo,bi))
        A=1.0d0 ; B=0.0d0
        !$omp target teams map(tofrom:A,B)
        !$omp loop
        DO i=1,bo
          !$omp loop order(concurrent)
          DO j=1,bi
             B(i,j) = A(i,j)
          END DO
          !$omp end loop
        END DO
        !$omp end loop
        !$omp end target teams
        print *,"sum(B)=",sum(B)
end program loop_directive 

3. REDUCTION Clause on TEAMS Directive

The REDUCTION clause could be used on a loop structure. Previously, REDUCTION was mainly a clause for the parallel for / parallel do directive, which reduces across threads in the same parallel region. In the newer OpenMP specification, the REDUCTION clause could however also be used with the TEAMS directive. The reduction can be performed across multiple teams.

The reduction clause is a reduction scoping clause and a reduction participating clause. For each list item, a private copy is created for each implicit task or SIMD lane and is initialized with the initializer value of the reduction identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction identifier.

For the TEAMS directive, one or more private copies of each list item are created for the initial task of each team in the league, leading to the same behavior as if the private clause had been used.

The following example shows how the REDUCTION clause is used on both the parallel and teams directive levels. The libomptarget debug output shows the reduction variables properly handled.

program reduction_clause
  implicit none
  integer ic,ib,ia,i1,i2
  integer :: E2_t, E2, E3
  i2 = 100;  i1 = 16; E2 = 0; E3 = 0
  !$omp target teams distribute reduction(+:E2,E3) private(E2_t) map(tofrom:E2)
  DO ic=1,i1
     E2_t = 0
     !$omp parallel do reduction(+:E2_t) collapse(2)
     DO ic=1,i2
        DO ia=1,i2
           E2_t=E2_t + 1
        ENDDO
     ENDDO
     !$omp end parallel do
     E2 = E2 + E2_t
     E3 = E3 + E2_t
  ENDDO
  !$omp end target teams distribute
  print *,"E2, E3=",E2,E3
end program reduction_clause 

4. NOWAIT Clause for Target Directive

The NOWAIT clause on the target region launching provides better capabilities for execution, asynchronous from the host device. When the encountering thread meets the target directive and passes the control flow to the device, the programmer can add the nowait clause after the omp target directive to branch out the control flow back to the host.

Then, both host and device will run their own workflow until a pedetermined synchronization point. One practical way to set the sync point is to insert “!$omp taskwait” before syncing host and device data. So, when both host and device workflows reach the point of data sync, the two branched control flows will merge back to the host.

The following example shows how to execute code asynchronously on a device and host. The nowait clause on a target directive allows the thread of the target task to perform other work while waiting for the target region execution to complete. Hence, the target region can execute asynchronously on the device (without requiring a host thread to idle while waiting for the target task execution to complete).

In this example, two arrays, A and C, are calculated (the calculation is repeated many times to amplify the impact of the device computation relative to the host device data transfer time). Array A is calculated on the device, and array B is calculated on the host concurrently.

The explicit barrier synchronization guarantees the completion of the target task (asynchronous target execution) before the sync of host and device data. For details, see the barrier glossary entry in the OpenMP specification.

$ cat nowait_clause.f90

program nowait_clause
        implicit none
        integer :: i,j,k, bo,bi
        real(8), dimension(:,:), allocatable :: A, B, C, D
        bo=1000 ; bi=1000
        allocate(A(bo,bi), B(bo,bi), C(bo,bi), D(bo,bi))
        A=2.0 ; B=1.0 ; C=2.0 ; D=1.0
        !$omp target enter data map(to: A, B)
        !$omp target nowait
        !$omp teams distribute parallel do
        do i=1,bo ; do j=1,bi ; do k=1,300000
            A(i,j)=A(i,j)*B(i,j)
        enddo ;  enddo ; enddo
        !$omp end teams distribute parallel do
        !$omp end target
        do i=1,bo ; do j=1,bi ; do k=1,20000
            C(i,j)=C(i,j)*D(i,j)
        enddo ; enddo ; enddo
        !$omp taskwait
        !$omp target exit data map(from: A, B)
        print *,"sum(A)=",sum(A)," sum(C)=",sum(C)
end program nowait_clause 

5. INTEROP Clause on DISPATCH for Function Variant

The DISPATCH directive was extended with the INTEROP clause to support appending arguments specific to a call site. The interop clause specifies additional arguments to pass to the function variant when a variant substitution occurs for the target call in a dispatch directive. The variables in the interop variable list are passed in the same order in which they are specified in the interop clause. Suppose the interop clause is specified on a dispatch directive. In that case, the matching “declare variant” directive for the target call must have an “append_args” clause with several list items equal to or exceeding the number of list items in the interop clause.

Example: OpenMP and SYCL Interoperability

The example below consists of Fortran code that invokes a variant function with an interop object and the variant function implemented in OpenMP and SYCL.

The first part of the Fortran code defines a module that contains an interface for the base routine “vnxc” and a variant routine “vnxc_gpu”. “vnxc” declares a replacement routine “vnxc_gpu” when it is called within a “dispatch” directive by using the “declare variant” directive with a “match” clause. The directive also uses the “append_args” clause to pass the interop object specified in the “dispatch” directive to the variant routine “vnxc_gpu”.

The main program in the Fortran code

  • initializes data array v1 on the host,
  • creates a target data region to map the data array v1 to and from the device,
  • creates an interop object “iop1” with SYCL queue access,

and

  • invokes the routine “vnxc” using the “dispatch” directive with “iop1”  specified in the “interop” clause.
  • iop1” is passed as an additional argument when the variant routine “vnxc_gpu” is invoked.

The SYCL part of the example implements the functions that may be interlanguage called by the Fortran program.

  • The Fortran program is supposed to call the replacement routine “vnxc_gpu”, so the routine “vnxc” is defined with a print stating it should not be called.
  • The routine “vnxc_gpu” performs simple vector-scalar multiplication using the SYCL queue retrieved from the interop object passed as the last argument to the routine. It accesses the SYCL queue and invokes SYCL code computing the data v1 already residing on the device memory.

 

$ cat interop_clause-fortran_interop.f90

module subs
  interface
    subroutine vnxc_gpu(c, v1, n, iop1)  !! variant function
      use iso_c_binding
      integer, intent(in)  :: c, n
      integer, intent(out) :: v1(10)
      type(c_ptr), intent(in):: iop1
    end subroutine vnxc_gpu
    subroutine vnxc(c, v1, n)  !! base function
      import vnxc_gpu          ! Need to add this statement
      integer, intent(in)  :: c, n
      integer, intent(out) :: v1(10)
      !$omp declare variant(vnxc:vnxc_gpu) &
      !$omp& match(construct={dispatch},device={arch(gen)})  &
      !$omp& append_args(interop(targetsync))
    end subroutine vnxc
  end interface
end module subs

program interop_clause
  use subs
  use omp_lib
  integer v1(10)
  integer i, n, d, c
  integer (kind=omp_interop_kind) :: iop1
  c = 2 ; n = 10 ; do i = 1, n ; v1(i) = i ; enddo
  d = omp_get_default_device()
  !$omp target data map(tofrom: v1(1:10)) use_device_addr(v1)
  !$omp interop init(prefer_type(omp_ifr_sycl), targetsync:iop1) device(d)
  !$omp dispatch device(d) interop(iop1)
  call vnxc(c, v1, n)
  !$omp interop destroy(iop1)
  !$omp end target data
  print *, "v1(1) = ", v1(1), " (2), v1(10) = ", v1(10), " (20)"
end program interop_clause

$ cat interop_clause-sycl_kernel.cpp

#include <omp.h>
#include <stdio.h>
#include <sycl/sycl.hpp>
#define EXTERN_C extern "C"
EXTERN_C void vnxc_(int *c, int *v1, int *n) {
  printf("ERROR: Base function foo should not be called\n");
}
EXTERN_C void vnxc_gpu_(int *c, int *v1, int *n, omp_interop_t obj) {
  int c_val = *c;
  int n_val = *n;
  if (omp_ifr_sycl != omp_get_interop_int(obj, omp_ipr_fr_id, nullptr)) {
    printf("ERROR: Failed to create interop with SYCL queue access\n");
    return;
  }
  auto *q = static_cast<sycl::queue *>(
      omp_get_interop_ptr(obj, omp_ipr_targetsync, nullptr));
  printf("Compute on device\n");
  q->parallel_for(n_val, [=](auto i) { v1[i] = c_val * v1[i]; });
  q->wait();
}

To compile the example, we need the following command line:

icpx -qopenmp -fopenmp-targets=spir64 -fsycl -c interop_clause-sycl_kernel.cpp

ifx -qopenmp -fopenmp-targets=spir64 -fsycl interop_clause-fortran_interop.f90 interop_clause-sycl_kernel.o

Then we can run the executable as follows:

LIBOMPTARGET_DEBUG=2 OMP_TARGET_OFFLOAD=MANDATORY ./a.out 2>null

6. Base Language Parallelism Auto-Offload Support

As language specifications for C/C++ and Fortran continue to evolve in the heterogeneous computing era, new language features are being added to express key behaviors and concepts of accelerated computing: concurrency, parallelism, memory hierarchy, and more.

Additional language features are under development to express memory hierarchy. Currently, some key language features expressing concurrency/parallelism are already available, such as “std::par” for C/C++ and “do concurrent” for Fortran. “std::par” and “do concurrent” represent similar ideas. In the interest of brevity and being concise, let us therefore focus on “do concurrent” .

The following is an example of how to use “do concurrent” auto-offload of a nested loop to the device. The device kernel generation is implementation- and vendor-dependent. For example, the Nvidia CUDA* Compiler generates auto-offload kernels through proprietary PTX instruction intrinsics. Intel generates auto-offload kernels using open-standard OpenMP implementations instead.

program doconcurrent_autooffload
    implicit none
    integer :: i, j
    real(8) :: fn
    real(8) , allocatable :: F(:,:), A(:)
    integer :: bo=200, bi=50
    !integer, parameter :: bo=200, bi=50
    allocate(F(bo,bi))
    allocate(A(bo))
    F=0.0001 ; A=0.0
    !$omp target enter data map(to: F, A)
    do concurrent (i=1:bo)
        fn = 0.0
        !$omp parallel loop
        do concurrent(j=1:bi) reduce(+:fn)
            fn = fn + F(i,j)
        enddo
        A(i) = fn
        !$omp parallel loop
        do concurrent(j=1:bi)
            A(i) = A(i) + F(i,j)
        enddo
    enddo
    !$omp target exit data map(from: F, A)
    print *, "sum(A)=",sum(A)
end program doconcurrent_autooffload

We need the following compiler flags

ifx -g -O3 -xHost -fp-model precise -fiopenmp -fopenmp-targets=spir64 -fopenmp-target-do-concurrent -fopenmp-do-concurrent-maptype-modifier=present -qopt-report=3 doconcurrent_autooffload.f90

to generate device kernels. As we can see in the optimization report file, the array data will be assumed to be in the device's global memory, and the auto-offloaded kernels will, therefore, not perform host device data movements.

We can set:

LIBOMPTARGET_DEBUG=1 LIBOMPTARGET_INFO=7

and at runtime, the libomptarget runtime debug utility will print out runtime information to show the auto-offloaded kernel execution policy.

Ever More Parallel Compute with OpenMP

OpenMP continues to be a robust, multi-vendor, multi-platform, multi-base language programming framework for accelerated computing. The new OpenMP5.x and the coming specification OpenMP 6.0 provides many new features to enable programmers to express the heterogeneous parallelism of the application more efficiently on GPU. OpenMP’s approach is open and standards-based, so vendors can offer their own implementation with optimized performance for the underlying hardware. At the same time, hardware details are transparent and can be exposed to programmers and researchers through runtimes. Application developers can write highly portable and performant codes that achieve the near-best GPU hardware potentials by simply implementing these high-level OpenMP programming concepts and techniques.

The OpenMP user community remains very active and keeps growing. At Intel, we are happy to be at the forefront of these discussions and integrate these capabilities in our own OpenMP runtime libraries, Intel® Fortran Compiler and Intel® oneAPI DPC++/C++ Compiler. We embrace the spirit of open standards, contributing to many open source initiatives for accelerated compute frameworks like OpenCL*, SYCL, and OpenMP and compiler initiatives like LLVM* and the intermediate SPIR-V* abstraction layer.  

Download the Compilers Now

You can download the Intel Fortran Compiler and Intel oneAPI DPC++/C++ Compiler on Intel’s oneAPI Developer Tools product page

They are also available as part of the Intel® oneAPI HPC Toolkit and the Intel® oneAPI Base Toolkit, respectively, which includes an advanced set of foundational tools, libraries, analysis, debug and code migration tools.

You may also want to check out our contributions to the LLVM compiler project on GitHub.

Additional Resources