Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB)...

Intel® Advisor User Guide

Download PDF

ID 766448

Date 6/24/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-6BDFB63D-3D24-42ED-98B0-4DF8CFCAEBCD

View Details

Document Table of Contents

Document Table of Contents x

Intel® Advisor User Guide

Intel® Advisor User Guide x

Introduction Install and Launch Intel® Advisor Set Up Project Analyze Vectorization Perspective Analyze CPU Roofline Model Threading Designs Model Offloading to a GPU Analyze GPU Roofline Design and Analyze Flow Graphs Minimize Analysis Overhead Analyze MPI Applications Manage Results Command Line Interface Troubleshooting Reference Appendix

Introduction x

Design and Optimization Methodology Tutorials and Samples Get Help and Support

Install and Launch Intel® Advisor x

Install Intel® Advisor Set Up Environment Variables Set Up System to Analyze GPU Kernels Set Up Environment to Offload SYCL, OpenMP* target, and OpenCL™ Applications to CPU Launch Intel® Advisor GUI Navigation Quick Start

Set Up Project x

Configure Target Application Build Target Application Create Project

Configure Target Application x

Limit the Number of Threads Used by Parallel Frameworks Choose a Small, Representative Data Set

Create Project x

Configure Project Configure Binary/Symbol Search Directories Configure Source Search Directory Binary/Symbol Search and Source Search Locations

Analyze Vectorization Perspective x

Run Vectorization and Code Insights Perspective from GUI Run Vectorization and Code Insights Perspective from Command Line Explore Vectorization and Code Insights Results

Run Vectorization and Code Insights Perspective from GUI x

Vectorization Accuracy Presets Customize Vectorization and Code Insights Perspective

Run Vectorization and Code Insights Perspective from Command Line x

Vectorization Accuracy Levels in Command Line

Explore Vectorization and Code Insights Results x

Vectorization Report Overview Examine Not-Vectorized and Under-Vectorized Loops Analyze Loop Call Count Investigate Memory Usage and Traffic Find Data Dependencies

Analyze CPU Roofline x

Run CPU / Memory Roofline Insights Perspective from GUI Run CPU / Memory Roofline Insights Perspective from Command Line Explore CPU/Memory Roofline Results

Run CPU / Memory Roofline Insights Perspective from GUI x

CPU Roofline Accuracy Presets Customize CPU / Memory Roofline Insights Perspective

Run CPU / Memory Roofline Insights Perspective from Command Line x

CPU Roofline Accuracy Levels in Command Line

Explore CPU/Memory Roofline Results x

CPU Roofline Report Overview Examine Bottlenecks on CPU Roofline Chart Examine Relationships Between Memory Levels Compare CPU Roofline Results

Model Threading Designs x

Run Threading Perspective from GUI Run Threading Perspective from Command Line Annotate Code for Deeper Analysis Explore Threading Results Model Threading Parallelism Check for Dependencies Issues Add Parallelism to Your Program

Run Threading Perspective from GUI x

Customize Threading Perspective

Run Threading Perspective from Command Line x

Threading Accuracy Levels in Command Line

Annotate Code for Deeper Analysis x

Annotate Code to Model Parallelism Annotations Annotation Report

Annotate Code to Model Parallelism x

Before Annotating Code for Deeper Analysis Use Amdahl's Law and Measure the Program Task Organization and Annotations Annotate Parallel Sites and Tasks Task Patterns Choose the Tasks Use Partially Parallel Programs with Intel® Advisor

Task Patterns x

Multiple Parallel Sites Data and Task Parallelism Mix and Match Tasks

Choose the Tasks x

Task Interactions and Suitability How Big Should a Task Be?

Annotations x

Annotation Types Annotation Definitions Files Add Annotations into Your Source Code Tips for Annotation Use with C/C++ Programs

Annotation Types x

Annotation Types Summary Annotation General Characteristics Site and Task Annotations for Simple Loops With One Task Site and Task Annotations for Parallel Sites with Multiple Tasks Lock Annotations Pause Collection and Resume Collection Annotations Special-purpose Annotations

Annotation Definitions Files x

Reference the Annotations Definitions Directory Include the Annotations Header File in C/C++ Sources

Add Annotations into Your Source Code x

Copy Annotations and Build Settings Using the Annotation Assistant Pane Insert Annotations in a Text Editor

Tips for Annotation Use with C/C++ Programs x

Control the Expansion of advisor-annotate.h Handle Compilation Issues that Appear After Adding advisor-annotate.h

Annotation Report x

Annotation Report, Clear Description of Storage Row Annotation Report, Disable Observations in Region Row Annotation Report, Pause Collection Row Annotation Report, Inductive Expression Row Annotation Report, Lock Row Annotation Report, Observe Uses Row Annotation Report, Reduction Row Annotation Report, Re-enable Observations at End of Region Row Annotation Report, Resume Collection Row Annotation Report, Site Row Annotation Report, Task Row Annotation Report, User Memory Allocator Use Row Annotation Report, User Memory Deallocator Use Row

Model Threading Parallelism x

Suitability Report Overview Choose Modeling Parameters in the Suitability Report Fix Annotation-related Errors Detected by the Suitability Tool Advanced Modeling Options Reduce Parallel Overhead, Lock Contention, and Enable Chunking

Reduce Parallel Overhead, Lock Contention, and Enable Chunking x

Reduce Site Overhead Reduce Task Overhead Reduce Lock Overhead Reduce Lock Contention Enable Task Chunking

Check for Dependencies Issues x

Code Locations Pane Filter Pane (Dependencies Report) Problems and Messages Pane Dependencies Source Window

Dependencies Source Window x

Code Locations Pane (Dependencies Source Window) Focus Code Location Pane Focus Code Location Call Stack Pane Related Code Locations Pane Related Code Location Call Stack Pane Relationship Diagram Pane

Add Parallelism to Your Program x

Before You Add Parallelism: Choose a Parallel Framework Add the Parallel Framework to Your Build Environment Annotation Report Replace Annotations with Intel® oneAPI Threading Building Blocks (oneTBB) Code Replace Annotations with OpenMP* Code Next Steps for the Parallel Program

Before You Add Parallelism: Choose a Parallel Framework x

Parallel Frameworks Intel® oneAPI Threading Building Blocks (oneTBB) OpenMP* Microsoft Task Parallel Library* (TPL) Other Parallel Frameworks

Add the Parallel Framework to Your Build Environment x

Enable Intel® oneAPI Threading Building Blocks (oneTBB) in your Build Environment Define the TBBROOT Environment Variable Enable C++11 Lambda Expression Support with Intel® oneAPI Threading Building Blocks (oneTBB) Enable OpenMP* in your Build Environment

Annotation Report x

Annotation Report Overview Locate Annotations with the Annotation Report

Replace Annotations with Intel® oneAPI Threading Building Blocks (oneTBB) Code x

Intel® oneAPI Threading Building Blocks (oneTBB) Mutexes Intel® oneAPI Threading Building Blocks (oneTBB) Simple Mutex - Example Test the Intel® oneAPI Threading Building Blocks (oneTBB) Synchronization Code Parallelize Functions - Intel® oneAPI Threading Building Blocks (oneTBB) Tasks Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Counted Loops See Also Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Loops with Complex Iteration Control

Replace Annotations with OpenMP* Code x

Add OpenMP Code to Synchronize the Shared Resources OpenMP Critical Sections Basic OpenMP Atomic Operations Advanced OpenMP Atomic Operations OpenMP Reduction Operations OpenMP Locks Test the OpenMP Synchronization Code Parallelize Functions - OpenMP Tasks Parallelize Data - OpenMP Counted Loops Parallelize Data - OpenMP Loops with Complex Iteration Control

Next Steps for the Parallel Program x

Use Intel® Inspector and Intel® VTune™ Profiler Debug Parallel Programs

Model Offloading to a GPU x

Run Offload Modeling Perspective from GUI Run Offload Modeling Perspective from Command Line Explore Offload Modeling Results Advanced Modeling Configuration

Run Offload Modeling Perspective from GUI x

Offload Modeling Accuracy Presets Customize Offload Modeling Perspective

Run Offload Modeling Perspective from Command Line x

Offload Modeling Accuracy Levels in Command Line Run GPU-to-GPU Performance Modeling from Command Line

Explore Offload Modeling Results x

Offload Modeling Report Overview Examine Regions Recommended for Offloading Examine Data Transfers for Modeled Regions Check for Dependency Issues Explore Performance Gain from GPU-to-GPU Modeling Investigate Non-Offloaded Code Regions

Advanced Modeling Configuration x

Model Application Performance on a Custom Target GPU Device Check How Assumed Dependencies Affect Modeling Manage Invocation Taxes Enforce Offloading for Specific Loops

Analyze GPU Roofline x

Test Topic for Embedded Help Run GPU Roofline Insights Perspective from GUI Run GPU Roofline Insights Perspective from Command Line Explore GPU Roofline Results

Run GPU Roofline Insights Perspective from GUI x

GPU Roofline Accuracy Presets Customize GPU Roofline Insights Perspective

Run GPU Roofline Insights Perspective from Command Line x

GPU Roofline Accuracy Levels in Command Line

Explore GPU Roofline Results x

Examine GPU Roofline Summary Examine Bottlenecks on GPU Roofline Chart Examine Kernel Details Compare GPU Roofline Results

Design and Analyze Flow Graphs x

Where to Find the Flow Graph Analyzer Launching the Flow Graph Analyzer Flow Graph Analyzer GUI Overview Flow Graph Analyzer Workflows Designer Workflow Generating C++ Stubs Preferences Scalability Analysis Collecting Traces from Applications Nested Parallelism in Flow Graph Analyzer Analyzer Workflow Experimental Support for OpenMP* Applications Sample Trace Files Additional Resources

Flow Graph Analyzer GUI Overview x

Menus Toolbars Tabs Main Canvas Charts

Designer Workflow x

Adding Nodes, Edges, and Ports Modifying Node Properties Viewing Edge Properties Validating a Graph Saving a Graph to a File

Scalability Analysis x

Activating the Graph Scalability Analysis Prerequisites Running the Scalability Analysis Exploring the Parallelism in a Concurrent Node Showing Non-Parallel Nature of a Serial Node Explore Parallelism Provided by the Topology of a Graph Understanding Analysis Color Codes

Scalability Analysis Prerequisites x

Setting Concurrency Specification Setting Data Count Setting Node Weight

Collecting Traces from Applications x

Building an Application for Trace Collection Collecting Trace Files

Building an Application for Trace Collection x

Building an Application on Windows* OS Building an Application on Linux* OS Building an Application on macOS*

Collecting Trace Files x

Collect Traces In the Flow Graph Analyzer GUI Collect Traces Outside the Flow Graph Analyzer GUI

Collect Traces Outside the Flow Graph Analyzer GUI x

Collecting Trace Files with fgtrun Script Collecting Trace Files without fgtrun Script

Analyzer Workflow x

Find Time Regions of Low Concurrency and Their Cause Finding a Critical Path Finding Tasks with Small Durations Reduce Scheduler Overhead using Lightweight Policy Identifying Tasks that Operate on Common Input Support for SYCL

Support for SYCL x

Collect SYCL Application Traces Examine a SYCL Application Graph Find Issues Using Static Rule-check Engine

Examine a SYCL Application Graph x

Hotspot View View Performance Inefficiencies of Data-parallel Constructs

Find Issues Using Static Rule-check Engine x

Issue: Const Reference to a Host Pointer Used to Initialize a Buffer Issue: Host Pointer Accessor Used in a Loop Issue: Data Parallel Construct Inefficiency

Experimental Support for OpenMP* Applications x

Collecting Traces for OpenMP* Applications OpenMP* Constructs in the Per-Thread Task View OpenMP* Constructs in the Graph Canvas

Sample Trace Files x

code_generation Samples performance_analysis Samples

Minimize Analysis Overhead x

Collection Controls to Minimize Analysis Overhead Loop Markup to Minimize Analysis Overhead Filtering to Minimize Analysis Overhead Execution Speed/Duration/Scope Properties to Minimize Analysis Overhead Miscellaneous Techniques to Minimize Analysis Overhead

Analyze MPI Applications x

Model MPI Application Performance on GPU Control Collection with an MPI_Pcontrol Function

Manage Results x

Open a Result Rename an Existing Result Delete a Result Save Results to a Custom Location Work with Standalone HTML Reports Create a Read-only Result Snapshot Create a Result Snapshot Dialog Box

Command Line Interface x

advisor Command Line Interface Reference Offload Modeling Command Line Reference Generate Pre-configured Command Lines

advisor Command Line Interface Reference x

advisor Command Action Reference advisor Command Option Reference

advisor Command Action Reference x

collect command create-project help import-dir mark-up-loops report snapshot version workflow

advisor Command Option Reference x

accuracy append app-working-dir assume-dependencies assume-hide-taxes assume-ndim-dependency assume-single-data-transfer auto-finalize batching benchmarks-sync bottom-up cache-binaries cache-binaries-mode cache-config cache-simulation cache-sources cachesim cachesim-associativity cachesim-cacheline-size cachesim-mode cachesim-sampling-factor cachesim-sets check-profitability clear config count-logical-instructions count-memory-instructions count-memory-objects-accesses count-mov-instructions count-send-latency cpu-scale-factor csv-delimiter custom-config data-limit data-reuse-analysis data-transfer data-transfer-histogram data-transfer-page-size data-type delete-tripcounts disable-fp64-math-optimization display-callstack dry-run duration dynamic enable-cache-simulation enable-data-transfer-analysis enable-task-chunking enforce-baseline-decomposition enforce-fallback enforce-offloads estimate-max-speedup evaluate-min-speedup exclude-files executable-of-interest exp-dir filter filter-by-scope filter-reductions flop force-32bit-arithmetics force-64bit-arithmetics format gpu gpu-carm gpu-kernel-of-interest gpu-sampling-interval hide-data-transfer-tax ignore ignore-app-mismatch ignore-checksums instance-of-interest integrated interval limit loop-call-count-limit loop-filter-threshold loops mark-up mark-up-list memory-level memory-operation-type mix mkl-user-mode model-baseline-gpu model-children model-extended-math model-system-calls module-filter module-filter-mode mpi-rank mrte-mode ndim-depth-limit option-file overlap-taxes pack profile-gpu profile-intel-perf-libs profile-jit profile-python profile-stripped-binaries project-dir quiet recalculate-time record-mem-allocations record-stack-frame reduce-lock-contention reduce-lock-overhead reduce-site-overhead reduce-task-overhead refinalize-survey remove report-output report-template result-dir resume-after return-app-exitcode search-dir search-n-dim select set-dependency set-parallel set-parameter show-all-columns show-all-rows show-functions show-loops show-not-executed show-report small-node-filter sort-asc sort-desc spill-analysis stack-access-granularity stack-stitching stack-unwind-limit stacks stackwalk-mode start-paused static-instruction-mix strategy support-multi-isa-binaries target-device target-gpu target-pid target-process target-system threading-model threads top-down trace-mode trace-mpi track-memory-objects track-stack-accesses track-stack-variables trip-counts verbose with-stack

Offload Modeling Command Line Reference x

run_oa.py Options collect.py Options analyze.py Options

Troubleshooting x

Error Message: Application Sets Its Own Handler for Signal Error Message: Cannot Collect GPU Hardware Metrics for the Selected GPU Adapter Error Message: Memory Model Cache Hierarchy Incompatible Error Message: No Annotations Found Error Message: No Data Is Collected Error Message: Stack Size Is Too Small Error Message: Undefined Linker References to dlopen or dlsym Problem: Broken Call Tree Problem: Code Region is not Marked Up Problem: Debug Information Not Available Problem: No Data Problem: Source Not Available Problem: Stack in the Top-Down Tree Window Is Incorrect Problem: Survey Tool does not Display Survey Report Problem: Unexpected C/C++ Compilation Errors After Adding Annotations Problem: Unexpected Unmatched Annotations in the Dependencies Report Warning: Analysis of Debug Build Warning: Analysis of Release Build

Reference x

Data Reference Dependencies Problem and Message Types Recommendation Reference User Interface Reference

Data Reference x

CPU Metrics Accelerator Metrics

Dependencies Problem and Message Types x

Dangling Lock Data Communication Data Communication, Child Task Inconsistent Lock Use Lock Hierarchy Violation Memory Reuse Memory Reuse, Child Task Memory Watch Missing End Site Missing End Task Missing Start Site Missing Start Task No Tasks in Parallel Site One Task Instance in Parallel Site Orphaned Task Parallel Site Information Thread Information Unhandled Application Exception

Recommendation Reference x

Vectorization Recommendations for C++ Vectorization Recommendations for Fortran

User Interface Reference x

Dialog Box: Corresponding Command Line Dialog Box: Create a Project Dialog Box: Create a Result Snapshot Dialog Box: Options - Assembly Editor Tab Dialog Box: Options - General Dialog Box: Options - Result Location Dialog Box: Project Properties - Analysis Target Dialog Box: Project Properties - Binary/Symbol Search Dialog Box: Project Properties - Source Search Pane: Advanced View Pane: Analysis Workflow Pane: Roofline Chart Pane: GPU Roofline Chart Project Navigator Pane Toolbar: Intel Advisor Annotation Report Window: Dependencies Source Window: GPU Roofline Regions Window: GPU Roofline Insights Summary Window: Memory Access Patterns Source Window: Offload Modeling Summary Window: Offload Modeling Report - Accelerated Regions Window: Perspective Selector Window: Refinement Reports Window: Suitability Report Window: Suitability Source Window: Survey Report Window: Survey Source Window: Threading Summary Window: Vectorization Summary

Window: Refinement Reports x

Tab: Dependencies Report Tab: Memory Access Patterns Report

Appendix x

Data Sharing Problems Notational Conventions Key Concepts Related Information

Data Sharing Problems x

Data Sharing Problem Types Problem Solving Strategies

Data Sharing Problem Types x

Incidental Sharing Independent Updates

Problem Solving Strategies x

Eliminate Incidental Sharing Synchronize Independent Updates Difficult Problems: Choosing a Different Set of Tasks Fix Problems in Code Used by Multiple Parallel Sites Memory That is Accessed Through a Pointer

Eliminate Incidental Sharing x

Examine the Task's Static and Dynamic Extent Verify Whether Incidental Sharing Exists Create the Private Memory Location Pointer Dereferences

Synchronize Independent Updates x

Synchronization Explicit Locking Assign Locks to Transactions Pitfalls from Using Synchronization

Key Concepts x

Glossary Parallelism

Parallelism x

Parallel Processing Terminology Add Parallelism Common Issues When Adding Parallelism Parallel Programming Implementations

Intel® Advisor User Guide

Introduction

Design and Optimization Methodology

Tutorials and Samples

Get Help and Support

Install and Launch Intel® Advisor

Install Intel® Advisor

Set Up Environment Variables

Set Up System to Analyze GPU Kernels

Set Up Environment to Offload SYCL, OpenMP* target, and OpenCL™ Applications to CPU

Launch Intel® Advisor

GUI Navigation Quick Start

Set Up Project

Configure Target Application

Limit the Number of Threads Used by Parallel Frameworks

Choose a Small, Representative Data Set

Build Target Application

Create Project

Configure Project

Configure Binary/Symbol Search Directories

Configure Source Search Directory

Binary/Symbol Search and Source Search Locations

Analyze Vectorization Perspective

Run Vectorization and Code Insights Perspective from GUI

Vectorization Accuracy Presets

Customize Vectorization and Code Insights Perspective

Run Vectorization and Code Insights Perspective from Command Line

Vectorization Accuracy Levels in Command Line

Explore Vectorization and Code Insights Results

Vectorization Report Overview

Examine Not-Vectorized and Under-Vectorized Loops

Analyze Loop Call Count

Investigate Memory Usage and Traffic

Find Data Dependencies

Analyze CPU Roofline

Run CPU / Memory Roofline Insights Perspective from GUI

CPU Roofline Accuracy Presets

Customize CPU / Memory Roofline Insights Perspective

Run CPU / Memory Roofline Insights Perspective from Command Line

CPU Roofline Accuracy Levels in Command Line

Explore CPU/Memory Roofline Results

CPU Roofline Report Overview

Examine Bottlenecks on CPU Roofline Chart

Examine Relationships Between Memory Levels

Compare CPU Roofline Results

Model Threading Designs

Run Threading Perspective from GUI

Customize Threading Perspective

Run Threading Perspective from Command Line

Threading Accuracy Levels in Command Line

Annotate Code for Deeper Analysis

Annotate Code to Model Parallelism

Before Annotating Code for Deeper Analysis

Use Amdahl's Law and Measure the Program

Task Organization and Annotations

Annotate Parallel Sites and Tasks

Task Patterns

Multiple Parallel Sites

Data and Task Parallelism

Mix and Match Tasks

Choose the Tasks

Task Interactions and Suitability

How Big Should a Task Be?

Use Partially Parallel Programs with Intel® Advisor

Annotations

Annotation Types

Annotation Types Summary

Annotation General Characteristics

Site and Task Annotations for Simple Loops With One Task

Site and Task Annotations for Parallel Sites with Multiple Tasks

Lock Annotations

Pause Collection and Resume Collection Annotations

Special-purpose Annotations

Annotation Definitions Files

Reference the Annotations Definitions Directory

Include the Annotations Header File in C/C++ Sources

Add Annotations into Your Source Code

Copy Annotations and Build Settings Using the Annotation Assistant Pane

Insert Annotations in a Text Editor

Tips for Annotation Use with C/C++ Programs

Control the Expansion of advisor-annotate.h

Handle Compilation Issues that Appear After Adding advisor-annotate.h

Annotation Report

Annotation Report, Clear Description of Storage Row

Annotation Report, Disable Observations in Region Row

Annotation Report, Pause Collection Row

Annotation Report, Inductive Expression Row

Annotation Report, Lock Row

Annotation Report, Observe Uses Row

Annotation Report, Reduction Row

Annotation Report, Re-enable Observations at End of Region Row

Annotation Report, Resume Collection Row

Annotation Report, Site Row

Annotation Report, Task Row

Annotation Report, User Memory Allocator Use Row

Annotation Report, User Memory Deallocator Use Row

Explore Threading Results

Model Threading Parallelism

Suitability Report Overview

Choose Modeling Parameters in the Suitability Report

Fix Annotation-related Errors Detected by the Suitability Tool

Advanced Modeling Options

Reduce Parallel Overhead, Lock Contention, and Enable Chunking

Reduce Site Overhead

Reduce Task Overhead

Reduce Lock Overhead

Reduce Lock Contention

Enable Task Chunking

Check for Dependencies Issues

Code Locations Pane

Filter Pane (Dependencies Report)

Problems and Messages Pane

Dependencies Source Window

Code Locations Pane (Dependencies Source Window)

Focus Code Location Pane

Focus Code Location Call Stack Pane

Related Code Locations Pane

Related Code Location Call Stack Pane

Relationship Diagram Pane

Add Parallelism to Your Program

Before You Add Parallelism: Choose a Parallel Framework

Parallel Frameworks

Intel® oneAPI Threading Building Blocks (oneTBB)

OpenMP*

Microsoft Task Parallel Library* (TPL)

Other Parallel Frameworks

Add the Parallel Framework to Your Build Environment

Enable Intel® oneAPI Threading Building Blocks (oneTBB) in your Build Environment

Define the TBBROOT Environment Variable

Enable C++11 Lambda Expression Support with Intel® oneAPI Threading Building Blocks (oneTBB)

Enable OpenMP* in your Build Environment

Annotation Report

Annotation Report Overview

Locate Annotations with the Annotation Report

Replace Annotations with Intel® oneAPI Threading Building Blocks (oneTBB) Code

Intel® oneAPI Threading Building Blocks (oneTBB) Mutexes

Intel® oneAPI Threading Building Blocks (oneTBB) Simple Mutex - Example

Test the Intel® oneAPI Threading Building Blocks (oneTBB) Synchronization Code

Parallelize Functions - Intel® oneAPI Threading Building Blocks (oneTBB) Tasks

Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Counted Loops

See Also

Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Loops with Complex Iteration Control

Replace Annotations with OpenMP* Code

Add OpenMP Code to Synchronize the Shared Resources

OpenMP Critical Sections

Basic OpenMP Atomic Operations

Advanced OpenMP Atomic Operations

OpenMP Reduction Operations

OpenMP Locks

Test the OpenMP Synchronization Code

Parallelize Functions - OpenMP Tasks

Parallelize Data - OpenMP Counted Loops

Parallelize Data - OpenMP Loops with Complex Iteration Control

Next Steps for the Parallel Program

Use Intel® Inspector and Intel® VTune™ Profiler

Debug Parallel Programs

Model Offloading to a GPU

Run Offload Modeling Perspective from GUI

Offload Modeling Accuracy Presets

Customize Offload Modeling Perspective

Run Offload Modeling Perspective from Command Line

Offload Modeling Accuracy Levels in Command Line

Run GPU-to-GPU Performance Modeling from Command Line

Explore Offload Modeling Results

Offload Modeling Report Overview

Examine Regions Recommended for Offloading

Examine Data Transfers for Modeled Regions

Check for Dependency Issues

Explore Performance Gain from GPU-to-GPU Modeling

Investigate Non-Offloaded Code Regions

Advanced Modeling Configuration

Model Application Performance on a Custom Target GPU Device

Check How Assumed Dependencies Affect Modeling

Manage Invocation Taxes

Enforce Offloading for Specific Loops

Analyze GPU Roofline

Test Topic for Embedded Help

Run GPU Roofline Insights Perspective from GUI

GPU Roofline Accuracy Presets

Customize GPU Roofline Insights Perspective

Run GPU Roofline Insights Perspective from Command Line

GPU Roofline Accuracy Levels in Command Line

Explore GPU Roofline Results

Examine GPU Roofline Summary

Examine Bottlenecks on GPU Roofline Chart

Examine Kernel Details

Compare GPU Roofline Results

Design and Analyze Flow Graphs

Where to Find the Flow Graph Analyzer

Launching the Flow Graph Analyzer

Flow Graph Analyzer GUI Overview

Menus

Toolbars

Tabs

Main Canvas

Charts

Flow Graph Analyzer Workflows

Designer Workflow

Adding Nodes, Edges, and Ports

Modifying Node Properties

Viewing Edge Properties

Validating a Graph

Saving a Graph to a File

Generating C++ Stubs

Preferences

Scalability Analysis

Activating the Graph

Scalability Analysis Prerequisites

Setting Concurrency Specification

Setting Data Count

Setting Node Weight

Running the Scalability Analysis

Exploring the Parallelism in a Concurrent Node

Showing Non-Parallel Nature of a Serial Node

Explore Parallelism Provided by the Topology of a Graph

Understanding Analysis Color Codes

Collecting Traces from Applications

Building an Application for Trace Collection

Building an Application on Windows* OS

Building an Application on Linux* OS

Building an Application on macOS*

Collecting Trace Files

Collect Traces In the Flow Graph Analyzer GUI

Collect Traces Outside the Flow Graph Analyzer GUI

Collecting Trace Files with fgtrun Script

Collecting Trace Files without fgtrun Script

Nested Parallelism in Flow Graph Analyzer

Analyzer Workflow

Find Time Regions of Low Concurrency and Their Cause

Finding a Critical Path

Finding Tasks with Small Durations

Reduce Scheduler Overhead using Lightweight Policy

Identifying Tasks that Operate on Common Input

Support for SYCL

Collect SYCL Application Traces

Examine a SYCL Application Graph

Hotspot View

View Performance Inefficiencies of Data-parallel Constructs

Find Issues Using Static Rule-check Engine

Issue: Const Reference to a Host Pointer Used to Initialize a Buffer

Issue: Host Pointer Accessor Used in a Loop

Issue: Data Parallel Construct Inefficiency

Experimental Support for OpenMP* Applications

Collecting Traces for OpenMP* Applications

OpenMP* Constructs in the Per-Thread Task View

OpenMP* Constructs in the Graph Canvas

Sample Trace Files

code_generation Samples

performance_analysis Samples

Additional Resources

Minimize Analysis Overhead

Collection Controls to Minimize Analysis Overhead

Loop Markup to Minimize Analysis Overhead

Filtering to Minimize Analysis Overhead

Execution Speed/Duration/Scope Properties to Minimize Analysis Overhead

Miscellaneous Techniques to Minimize Analysis Overhead

Analyze MPI Applications

Model MPI Application Performance on GPU

Control Collection with an MPI_Pcontrol Function

Manage Results

Open a Result

Rename an Existing Result

Delete a Result

Save Results to a Custom Location

Work with Standalone HTML Reports

Create a Read-only Result Snapshot

Create a Result Snapshot Dialog Box

Command Line Interface

advisor Command Line Interface Reference

advisor Command Action Reference

collect

command

create-project

help

import-dir

mark-up-loops

report

snapshot

version

workflow

advisor Command Option Reference

accuracy

append

app-working-dir

assume-dependencies

assume-hide-taxes

assume-ndim-dependency

assume-single-data-transfer

auto-finalize

batching

benchmarks-sync

bottom-up

cache-binaries

cache-binaries-mode

cache-config

cache-simulation

cache-sources

cachesim

cachesim-associativity

cachesim-cacheline-size

cachesim-mode

cachesim-sampling-factor

cachesim-sets

check-profitability

clear

config

count-logical-instructions

count-memory-instructions

count-memory-objects-accesses

count-mov-instructions

count-send-latency

cpu-scale-factor

csv-delimiter

custom-config

data-limit

data-reuse-analysis

data-transfer

data-transfer-histogram

data-transfer-page-size

data-type

delete-tripcounts

disable-fp64-math-optimization

display-callstack

dry-run

duration

dynamic

enable-cache-simulation

enable-data-transfer-analysis

enable-task-chunking

enforce-baseline-decomposition

enforce-fallback

enforce-offloads

estimate-max-speedup

evaluate-min-speedup

exclude-files

executable-of-interest

exp-dir

filter

filter-by-scope

filter-reductions

flop

force-32bit-arithmetics

force-64bit-arithmetics

format

gpu-carm

gpu-kernel-of-interest

gpu-sampling-interval

hide-data-transfer-tax

ignore

ignore-app-mismatch

ignore-checksums

instance-of-interest

integrated

interval

limit

loop-call-count-limit

loop-filter-threshold

loops

mark-up

mark-up-list

memory-level

memory-operation-type

mkl-user-mode

model-baseline-gpu

model-children

model-extended-math

model-system-calls

module-filter

module-filter-mode

mpi-rank

mrte-mode

ndim-depth-limit

option-file

overlap-taxes

pack

profile-gpu

profile-intel-perf-libs

profile-jit

profile-python

profile-stripped-binaries

project-dir

quiet

recalculate-time

record-mem-allocations

record-stack-frame

reduce-lock-contention

reduce-lock-overhead

reduce-site-overhead

reduce-task-overhead

refinalize-survey

remove

report-output

report-template

result-dir

resume-after

return-app-exitcode

search-dir

search-n-dim

select

set-dependency

set-parallel

set-parameter

show-all-columns

show-all-rows

show-functions

show-loops

show-not-executed

show-report

small-node-filter

sort-asc

sort-desc

spill-analysis

stack-access-granularity

stack-stitching

stack-unwind-limit

stacks

stackwalk-mode

start-paused

static-instruction-mix

strategy

support-multi-isa-binaries

target-device

target-gpu

target-pid

target-process

target-system

threading-model

threads

top-down

trace-mode

trace-mpi

track-memory-objects

track-stack-accesses

track-stack-variables

trip-counts

verbose

with-stack

Offload Modeling Command Line Reference

run_oa.py Options

collect.py Options

analyze.py Options

Generate Pre-configured Command Lines

Troubleshooting

Error Message: Application Sets Its Own Handler for Signal

Error Message: Cannot Collect GPU Hardware Metrics for the Selected GPU Adapter

Error Message: Memory Model Cache Hierarchy Incompatible

Error Message: No Annotations Found

Error Message: No Data Is Collected

Error Message: Stack Size Is Too Small

Error Message: Undefined Linker References to dlopen or dlsym

Problem: Broken Call Tree

Problem: Code Region is not Marked Up

Problem: Debug Information Not Available

Problem: No Data

Problem: Source Not Available

Problem: Stack in the Top-Down Tree Window Is Incorrect

Problem: Survey Tool does not Display Survey Report

Problem: Unexpected C/C++ Compilation Errors After Adding Annotations

Problem: Unexpected Unmatched Annotations in the Dependencies Report

Warning: Analysis of Debug Build

Warning: Analysis of Release Build

Reference

Data Reference

CPU Metrics

Accelerator Metrics

Dependencies Problem and Message Types

Dangling Lock

Data Communication

Data Communication, Child Task

Inconsistent Lock Use

Lock Hierarchy Violation

Memory Reuse

Memory Reuse, Child Task

Memory Watch

Missing End Site

Missing End Task

Missing Start Site

Missing Start Task

No Tasks in Parallel Site

One Task Instance in Parallel Site

Orphaned Task

Parallel Site Information

Thread Information

Unhandled Application Exception

Recommendation Reference

Vectorization Recommendations for C++

Vectorization Recommendations for Fortran

User Interface Reference

Dialog Box: Corresponding Command Line

Dialog Box: Create a Project

Dialog Box: Create a Result Snapshot

Dialog Box: Options - Assembly

Editor Tab

Dialog Box: Options - General

Dialog Box: Options - Result Location

Dialog Box: Project Properties - Analysis Target

Dialog Box: Project Properties - Binary/Symbol Search

Dialog Box: Project Properties - Source Search

Pane: Advanced View

Pane: Analysis Workflow

Pane: Roofline Chart

Pane: GPU Roofline Chart

Project Navigator Pane

Toolbar: Intel Advisor

Annotation Report

Window: Dependencies Source

Window: GPU Roofline Regions

Window: GPU Roofline Insights Summary

Window: Memory Access Patterns Source

Window: Offload Modeling Summary

Window: Offload Modeling Report - Accelerated Regions

Window: Perspective Selector

Window: Refinement Reports

Tab: Dependencies Report

Tab: Memory Access Patterns Report

Window: Suitability Report

Window: Suitability Source

Window: Survey Report

Window: Survey Source

Window: Threading Summary

Window: Vectorization Summary

Appendix

Data Sharing Problems

Data Sharing Problem Types

Incidental Sharing

Independent Updates

Problem Solving Strategies

Eliminate Incidental Sharing

Examine the Task's Static and Dynamic Extent

Verify Whether Incidental Sharing Exists

Create the Private Memory Location

Pointer Dereferences

Synchronize Independent Updates

Synchronization

Explicit Locking

Assign Locks to Transactions

Pitfalls from Using Synchronization

Difficult Problems: Choosing a Different Set of Tasks

Fix Problems in Code Used by Multiple Parallel Sites

Memory That is Accessed Through a Pointer

Notational Conventions

Key Concepts

Glossary

Parallelism

Parallel Processing Terminology

Add Parallelism

Common Issues When Adding Parallelism

Parallel Programming Implementations

Related Information

Visible to Intel only — GUID: GUID-6BDFB63D-3D24-42ED-98B0-4DF8CFCAEBCD

View Details

Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Counted Loops

When tasks are loop iterations, and the iterations are over a range of values that are known before the loop starts, the loop is easily expressed in Intel® oneAPI Threading Building Blocks (oneTBB) .

Consider the following serial code and the need to add parallelism to this loop:

    ANNOTATE_SITE_BEGIN(sitename);
        for (int i = lo; i < hi; ++i) {
            ANNOTATE__ITERATION_TASK(taskname);
                statement;
        }
    ANNOTATE_SITE_END();

Here is the serial example converted to use oneTBB , after you remove the Intel Advisor annotations:

#include <tbb/tbb.h>
    ...
    tbb::parallel_for( lo, hi, 
        [&](int i) {statement;}
    );

The first two parameters are the loop bounds. As is typical in C++ (especially STL) programming, the lower bound is inclusive and the upper bound is exclusive. The third parameter is the loop body, wrapped in a lambda expression. The loop body will be called in parallel by threads created by oneTBB . As described before in Create the Tasks, Using C++ structs Instead of Lambda Expressions, the lambda expressions can be replaced with instances of explicitly defined class objects.

Parent topic: Replace Annotations with Intel® oneAPI Threading Building Blocks (oneTBB) Code

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Advisor User Guide

Parallelize Data - Intel® oneAPI Threading Building Blocks (oneTBB) Counted Loops

See Also