Enable Platform-Appropriate Vectorization

Intel® VTune™ Profiler

Performance Analysis Tutorial for Linux* OS

Download PDF

ID 762029

Date 12/20/2024

Version

Public

Enable Platform-Appropriate Vectorization

Enable the use of vector registers as appropriate for your platform. Then check to see if the vectorization efficiency improves.

You use the -xCOMMON-AVX512 option to compile the matrix application using the best instruction set extension that is supported by your processor. To generate multiple code paths that enable your software to run on several microarchitectures, refer to the ax, Qax option of the Intel® oneAPI DPC++/C++ Compiler.

Enable Full Vectorization

To enable the use of a vector instruction set that is appropriate for your platform, instruct the compiler to use the same vector extension as the best one that is available for your processor.

Follow these steps to enable platform-appropriate vectorization :

Open the Makefile located in ../matrix/linux with a text editor.
Change line 43 from:
```
OPTFLAGS = 
```
to
```
OPTFLAGS = -xCOMMON-AVX512
```
This option forces the compiler to use the AVX512 instruction set. You can also use the -xHost option to instruct the compiler to use the best instruction set extension that is available to your processor.
Save and close the Makefile and recompile the application using command:
```
make icc
```

Check Vectorization with HPC Performance Characterization Analysis

Repeat the HPC Performance Characterization analysis to ensure that the matrix application is properly vectorized.

Once the analysis is finished, see the result in the Summary window.

Notice that:

The Elapsed Time for the application has decreased slightly.
The Vectorization metric is 100%, so the code was fully vectorized.
100.0% of Packed DP FLOP instructions were executed using the 512-bit registers. The vector instruction set used is primarily AVX512.
Although the overall speed of the application improved, some performance indicators like CPI Rate and Memory Bound worsened with the vectorization change. This is because the loop optimizations may include factors like unrolling which affects the memory access pattern. The matrix sample contains advanced techniques like cache blocking to improve this condition.
The matrix application is extremely fast. This speed makes its performance more susceptible to interference from other tasks and overhead on your system. Minor optimizations may not consistently show the performance improvement you expect. Increasing the size of the matrices can help here.

Performance optimization is an iterative process. The matrix sample contains more techniques you can consider for performance improvement.

Parent topic: Tutorial: Analyze Common Performance Bottlenecks in a C++ Application with Intel® VTune™ Profiler (Linux* OS)

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Performance Analysis Tutorial for Linux* OS

Enable Platform-Appropriate Vectorization

Enable Full Vectorization

Check Vectorization with HPC Performance Characterization Analysis