Intel® Fortran Compiler Classic and Intel® Fortran Compiler Developer Guide and Reference

ID 767251
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Parallel Processing Model

A program containing OpenMP* directives begins execution as a single thread, called the initial thread of execution. The initial thread executes sequentially until the first parallel construct is encountered.

The PARALLEL and END PARALLEL directives define the extent of the parallel construct. When the initial thread encounters a parallel construct, it creates a team of threads, with the initial thread becoming the primary thread of the team. All program statements enclosed by the parallel construct are executed in parallel by each thread in the team, including all routines called from within the enclosed statements.

The TARGET and END TARGET directives define a block of code that is to be offloaded for execution on a GPU device. The DECLARE TARGET directive identifies a procedure that has a GPU device version that is to be called from a TARGET region.

The statements enclosed lexically within a construct define the static extent of the construct. The dynamic extent includes all statements encountered during the execution of a construct by a thread, including all called routines.

When a thread encounters the end of a structured block enclosed by a parallel construct, the thread waits until all threads in the team have arrived. When that happens the team is dissolved, and only the primary thread continues execution of the code following the parallel construct. The other threads in the team enter a wait state until they are needed to form another team. You can specify any number of parallel constructs in a single program. As a result, thread teams can be created and dissolved many times during program execution.

The following example illustrates, from a high level, the execution model for the OpenMP constructs. The comments in the code explain the structure of each construct or section.

PROGRAM MAIN           ! Begin serial execution.
  ...                  ! Only the initial thread executes.
 !$OMP PARALLEL        ! Begin a Parallel construct, form a team.
   ...                 ! This code is executed by each team member.
  !$OMP SECTIONS       ! Begin a worksharing construct.
    !$OMP SECTION      ! One unit of work.
     ...               !
    !$OMP SECTION      ! Another unit of work.
     ...               !
   !$OMP END SECTIONS  ! Wait until both units of work complete.
   ...                 ! More Replicated Code.
  !$OMP DO             ! Begin a worksharing construct,
     DO                !   each iteration is a unit of work.
     ...               ! Work is distributed among the team.
     END DO            !
  !$OMP END DO NOWAIT  ! End of worksharing construct, NOWAIT
                       !   is specified (threads need not wait).
                       ! This code is executed by each team member.
  !$OMP CRITICAL       ! Begin critical construct.
     ...               ! One thread executes at a time.
  !$OMP END CRITICAL   ! End the critical construct.
   ...                 ! This code is executed by each team member.
  !$OMP BARRIER        ! Wait for all team members to arrive.
   ...                 ! This code is executed by each team member.
 !$OMP END PARALLEL    ! End of parallel construct, disband team
                       !   and continue with serial execution.
 !$OMP TARGET          ! This code is compiled and executed on a GPU device
  ...
 !$OMP END TARGET
  ...                  ! Possibly more parallel or offload constructs  
END PROGRAM MAIN       ! End serial execution.

Use Orphaned Directives

In routines called from within parallel constructs, you can also use directives. Directives that are not in the static extent of the parallel construct, but are in the dynamic extent, are called orphaned directives. Orphaned directives allow you to execute portions of your program in parallel with only minimal changes to the sequential version of the program. Using this functionality, you can code parallel constructs at the top levels of your program call tree and use directives to control execution in any of the called routines. For example:

subroutine F 
... 
!$OMP PARALLEL...
   call G 
... 
subroutine G 
!$OMP DO... ! This is an orphaned directive. 
...

This is an orphaned DO directive since the parallel region is not lexically present in subroutine G .

Data Environment

You can control the data environment of OpenMP constructs by using data environment clauses supported by the construct. You can also privatize named global-lifetime objects by using the THREADPRIVATE directive.

Refer to the OpenMP specification for the full list of data environment clauses. Some commonly used ones include:

  • DEFAULT

  • SHARED

  • PRIVATE

  • FIRSTPRIVATE

  • LASTPRIVATE

  • REDUCTION

  • LINEAR

  • MAP

You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them; however, if you do not specify a data scope attribute clause on a directive, the behavior for the variable is determined by the default scoping rules, which are described in the OpenMP specification, for the variables affected by the directive.

Determine How Many Threads to Use

For applications where the workload depends on application input that can vary widely, delay the decision about the number of threads to employ until runtime when the input sizes can be examined. Examples of workload input parameters that affect the thread count include things like matrix size, database size, image/video size and resolution, depth/breadth/bushiness of tree-based structures, and size of list-based structures. Similarly, for applications designed to run on systems where the processor count can vary widely, defer choosing the number of threads to employ until application runtime when the machine size can be examined.

For applications where the amount of work is unpredictable from the input data, consider using a calibration step to understand the workload and system characteristics to aid in choosing an appropriate number of threads. If the calibration step is expensive, the calibration results can be made persistent by storing the results in a permanent place like the file system.

Avoid simultaneously using more threads than the number of processing units on the system. This situation causes the operating system to multiplex threads on the processors and typically yields sub-optimal performance.

When developing a library as opposed to an entire application, provide a mechanism whereby the user of the library can conveniently select the number of threads used by the library, because it is possible that the user has outer-level parallelism that renders the parallelism in the library unnecessary or even disruptive.

Use the NUM_THREADS clause on parallel regions to control the number of threads employed and use the IF clause on parallel regions to decide whether to employ multiple threads at all. The OMP_SET_NUM_THREADS() routine can also be used, but it also affects parallel regions created by the calling thread. The NUM_THREADS clause is local in its effect, so it does not impact other parallel regions. The disadvantages of explicitly setting the number of threads are:

  1. In a system with a large number of processors, your application will use some but not all of the processors.

  2. In a system with a small number of processors, your application may force over subscription that results in poor performance.

The Intel OpenMP runtime will create the same number of threads as the available number of logical processors unless you use the OMP_SET_NUM_THREADS() routine. To determine the actual limits, use OMP_GET_THREAD_LIMIT() and OMP_GET_MAX_ACTIVE_LEVELS(). Developers should carefully consider their thread usage and nesting of parallelism to avoid overloading the system. The OMP_THREAD_LIMIT environment variable limits the number of OpenMP threads to use for the whole OpenMP program. The OMP_MAX_ACTIVE_LEVELS environment variable limits the number of active nested parallel regions.

Binding Sets and Binding Regions

The binding task set for an OpenMP construct is the set of tasks that are affected by, or provide the context for, the execution of its region. It can be all tasks, the current team tasks, all tasks of the current team that are generated in the region, the binding implicit task, or the generating task.

The binding thread set for an OpenMP construct is the set of threads that are affected by, or provide the context for, the execution of its region. It can be all threads on a device, all threads in a contention group, all primary threads executing an enclosing teams region, the current team, or the encountering thread.

The binding region for an OpenMP construct is the enclosing region that determines the execution context and the scope of the effects of the directive:

  • The binding region for an ORDERED construct is the innermost enclosing DO loop region.

  • The binding region for a TASKWAIT construct is the innermost enclosing TASK region.

  • For all other constructs for which the binding thread set is the current team or the binding task set is the current team tasks, the binding region is the innermost enclosing PARALLEL region.

  • For constructs for which the binding task set is the generating task, the binding region is the region of the generating task.

  • A PARALLEL construct need not be active to be a binding region.

  • A TASK construct need not be explicit to be a binding region.

  • A region never binds to any region outside of the innermost enclosing parallel region.

  • See specific directive pages for binding information for additional directives.