Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Automatic Parallelization

The auto-parallelization feature of the Intel® C++ Compiler automatically translates serial portions of the input program into equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as needed in programming with OpenMP directives. The OpenMP and auto-parallelization functionality provides the performance gains from shared memory on multiprocessor and dual core systems.

The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

This behavior enables the potential exploitation of the parallel architecture found in symmetric multiprocessor (SMP) systems.

The guided auto-parallelization feature of the Intel® C++ Compiler helps you locate portions in your serial code that can be parallelized further. You can invoke guidance for parallelization, vectorization, or data transformation using specified compiler options of the [Q]guide series.

Automatic parallelization frees developers from having to:

  • Find loops that are good worksharing candidates.
  • Perform the dataflow analysis to verify correct parallel execution.
  • Partition the data for threaded code generation as is needed in programming with OpenMP directives.

Although OpenMP directives enable serial applications to transform into parallel applications quickly, you must explicitly identify specific portions of your application code that contain parallelism and add the appropriate compiler directives. Auto-parallelization, which is triggered by the [Q]parallel option, automatically identifies those loop structures that contain parallelism. During compilation, the compiler automatically attempts to deconstruct the code sequences into separate threads for parallel processing. No other effort is needed.

NOTE:
In order to execute a program that uses auto-parallelization on Linux or macOS systems, you must include the -parallel compiler option when you compile and link your program.

NOTE:

Using this option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch (Windows), -m (Linux and macOS), or [Q]x.

Serial code can be divided so that the code can execute concurrently on multiple threads. For example, consider the following serial code example:

void ser(int *a, int *b, int *c) {
  for (int i=0; i<100; i++)
    a[i] = a[i] + b[i] * c[i]; 
}

The following example illustrates one method showing how the loop iteration space, shown in the previous example, might be divided to execute on two threads:

void par(int *a, int *b, int *c) {
  int i;
  // Thread 1
  for (i=0; i<50; i++)
    a[i] = a[i] + b[i] * c[i];
  // Thread 2
  for (i=50; i<100; i++)
    a[i] = a[i] + b[i] * c[i]; 
}

Auto-Vectorization and Parallelization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8, or (up to) 16 elements in one operation, depending on the data type. In some cases, auto-parallelization and vectorization can be combined for better performance results.

The following example demonstrates how code can be designed to explicitly benefit from parallelization and vectorization. Assuming you compile the code shown below using the [Q]parallel option, the compiler will parallelize the outer loop and vectorize the innermost loop:

#include <stdio.h> 
#define ARR_SIZE 500 //Define array 
int main() {
  int matrix[ARR_SIZE][ARR_SIZE];
  int arrA[ARR_SIZE]={10};
  int arrB[ARR_SIZE]={30};
  int i, j;
  for(i=0;i<ARR_SIZE;i++) {
     for(j=0;j<ARR_SIZE;j++) { matrix[i][j] = arrB[i]*(arrA[i]%2+10); }
  } printf("%d\n",matrix[0][0]); 
}

Compiling the example code with the correct options, the compiler should report results similar to the following:

vectorization.c(18) : (col. 6) remark: LOOP WAS VECTORIZED.
	 vectorization.c(16) : (col. 3) remark: LOOP WAS AUTO-PARALLELIZED.

With the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program. The [Q]openmp option must be specifed to enable the OpenMP directives.

The following example demonstrates one method of using the OpenMP pragmas within code:

#include <stdio.h> 
#define ARR_SIZE 100 //Define array 
void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c); 
int main() {
  int arr_a[ARR_SIZE];
  int arr_b[ARR_SIZE];
  int arr_c[ARR_SIZE];
  int i,j;
  int matrix_a[ARR_SIZE][ARR_SIZE];
  int matrix_b[ARR_SIZE][ARR_SIZE];
  #pragma omp parallel for 
// Initialize the arrays and matrices.
  for(i=0;i<ARR_SIZE; i++) {
    arr_a[i]= i;
    arr_b[i]= i;
    arr_c[i]= ARR_SIZE-i;
    for(j=0; j<ARR_SIZE;j++) {
       matrix_a[i][j]= j;
       matrix_b[i][j]= i;
    }
  }
  foo(matrix_a, matrix_b, arr_a, arr_b, arr_c); 
} 
void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c) {                                  
  int i, num, arr_x[ARR_SIZE];
  #pragma omp parallel for private(num) 
// Expresses the parallelism using the OpenMP pragma: parallel for. 
// The pragma guides the compiler generating multithreaded code. 
// Array arr_X, mb, b, and c are shared among threads based on OpenMP 
// data sharing rules. Scalar num si specifed as private 
// for each thread.
  for(i=0;i<ARR_SIZE;i++) {
     num = ma[b[i]][c[i]];
     arr_x[i]= mb[a[i]][num];
     printf("Values: %d\n", arr_x[i]); //prints values 0-ARR_SIZE-1
   } 
}

NOTE:

Options that use OpenMP are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.

Using Parallelism Reports

To generate a parallelism report, use the -qopt-report-phase=par (Linux and macOS) or the /Qopt-report-phase:par option along with the qopt-report=n or /Qopt-report:n option (Windows). By default the auto-parallelism report generates a medium level of detail, where n=2. You can use [q or Q]opt-report option along with the [q or Q]opt-report-phase option if you want a greater or lesser level of detail. Specifying a value of '5' generates the maximum diagnostic details.

Run the report by entering commands similar to the following:

Linux

icpc -c -parallel qopt-report-phase=par qopt-report:5 sample.cpp

macOS

icl++ -c -parallel qopt-report-phase=par qopt-report:5 sample.cpp

Windows

icl /c /Qparallel /Qopt-report-phase:par /Qopt-report:5 sample.cpp

NOTE:
The -c (Linux and macOS) or /c (Windows) prevents linking and instructs the compiler to stop compilation after the object file is generated. The example is compiled without generating an executable.

The output, by default, produces a file with the same name as the object file, with .optrpt extension, and is written into the same directory as the object file. Using the above command-line entries, you will obtain an output file called sample.optrpt. Use the [q or Q]opt-report-file option to specify any other name for the output file that captures the report results. Use the arguments stdout or stderr to send the optimization report to stdout or stderr.

For example, assume you want a full diagnostic report on the following example code:

void no_par(void) {
  int i;
  int a[1000];
  for (i=1; i<1000; i++) {
    a[i] = (i * 2) % i * 1 + sqrt(i);
    a[i] = a[i-1] + i;
  }
} 

The following example output illustrates the diagnostic report generated by the compiler for the example code shown above. In most cases, the comment listed next to the line is self-explanatory:

procedure: no_par
sample.c(13):(3) remark #15048: DISTRIBUTED LOOP WAS AUTO-PARALLELIZED 
sample.c(13):(3) remark #15050: loop was not parallelized: existence of parallel dependence 
sample.c(19):(5) remark #15051: parallel dependence: proven FLOW dependence between a line 19, and a line 19 

For more information on options to generate reports see the Optimization Report Options topic.