Intel® VTune™ Profiler Performance Analysis Cookbook

ID 766316
Date 9/05/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Scheduling Overhead in Intel® Threading Building Blocks (Intel® TBB) Apps

This recipe shows how to detect and fix scheduling overhead for an Intel TBB application.

NOTE:
Intel® Threading Building Blocks library (previously a part of Intel® Parallel Studio XE and Intel® System Studio packages) has been replaced by Intel® oneAPI Threading Building Blocks (oneTBB). Download oneTBB from the Intel® oneAPI Base Toolkit.

Scheduling overhead is a typical problem of code threading when fine-grain chunks of work between threads are distributed dynamically. In the case of scheduling overhead, a significant amount of time spent by the scheduler to assign the work for working threads and the time worker threads spend waiting for a new portion of work make parallelism inefficient and, in ultimate cases, a threaded version of a program can be even slower than the sequential one. The majority of Intel TBB constructs use a default auto-partitioner that tailors the number of chunks larger than the default grain size to avoid overhead. If you use a simple partitioner either intentionally or with constructs like parallel_deterministic_reduce, you need to take care of a grain size since a simple partitioner divides the work into chunk sizes up to the default grain size of one iteration. Intel® VTune™ Amplifier helps detect scheduling overhead for Intel TBB applications and provides some advice on increasing the grain size to avoid a slowdown connected with it.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

  • Application: a sample application that calculates the sum of vector elements using Intel TBB parallel_deterministic_reduce template function.

  • Compiler : Intel® Compiler or GNU* compiler with compiler/linker options, for example:

    -I <tbb_install_dir>/include -g -O2 -std=c++11 -o vector-reduce vector-reduce.cpp -L <tbb_install_dir>/lib/intel64/gcc4.7 -ltbb

  • Performance analysis tools: Intel® VTune™ Amplifier 2019: Threading analysis

    NOTE:
    • Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler.

    • Most recipes in the Intel® VTune™ Profiler Performance Analysis Cookbook are flexible. You can apply them to different versions of Intel® VTune™ Profiler. In some cases, minor adjustments may be required.

    • Get the latest version of Intel® VTune™ Profiler:

  • Operating system: Ubuntu* 16.04 LTS

  • CPU: Intel Xeon® CPU E5-2699 v4 @ 2.20GHz

Create a Baseline

The initial version of the sample code uses parallel_deterministic_reduce with the default grain size (line 17-23):

#include <stdlib.h>
#include "tbb/tbb.h"

static const size_t SIZE = 50*1000*1000;
double v[SIZE];

using namespace tbb;

void VectorInit( double *v, size_t n )
{
    parallel_for( size_t( 0 ), n, size_t( 1 ), [=](size_t i){ v[i] = i * 2; } );
}

double VectorReduction( double *v, size_t n )
{

    return parallel_deterministic_reduce(
        blocked_range<double*>( v, v + n ),
        0.f,
        [](const blocked_range<double*>& r, double value)->double {
            return std::accumulate(r.begin(), r.end(), value);
        },
        std::plus<double>()
    );
}

int main(int argc, char *argv[])
{
    task_scheduler_init( task_scheduler_init::automatic );

    VectorInit( v, SIZE );

    double sum;

    for (int i=0; i<100; i++)
		  sum = VectorReduction( v, SIZE );

    return 0;
}

The vector sum calculation is repeated in the loop in line 35 to make compute work more significant and measurable for statistical analysis.

Running the compiled application takes about 9 seconds. This is a performance baseline that could be used for further optimizations.

Run Threading Analysis

To estimate threading parallelism of your application and time spent on scheduling overhead, run the Concurrency analysis:

  1. Click the New Project button on the toolbar and specify a name for the new project, for example: vector-reduce.

  2. Click Create Project.

    The Configure Analysis window opens.

  3. On the WHERE pane, select the Local Host target system type.

  4. On the WHAT pane, select the Launch Application target type and specify an application for analysis.

  5. On the HOW pane, click the browse button and select Parallelism > Threading analysis.

  6. Click theStart button.

VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

NOTE:

Since the analysis is based on instrumentation and uses a stack stitching technology, the elapsed time of the instrumented application can be slower than the original application run because of the collection overhead.

Identify Scheduling Overhead

Start your analysis with the Summary view that displays application-level statistics.

The Effective CPU Utilization Histogram shows that, on average, the application utilized only ~3 physical cores out of available 88:

Flagged Overhead Time metric and the Scheduling sub-metric signal a threading inefficiency problem that should be explored:

The Top Hotspots section shows [TBB Dispatch Loop] as the main time-consuming function. The hint on the flag associated with the function informs about scheduling overhead that should be addressed with increasing a grain size of parallel work:

The sample application contains two Intel TBB constructs - parallel_for initialization and parallel_deterministic_reduce - to calculate the sum of vector elements. Discover which Intel TBB construct brings overhead with the help of the Caller/Callee data view. Expanding the CPU Time: Total > Overhead Time columns sorts the grid by the Scheduling column. Find the first row with the Intel TBB parallel construct in the Function list:

This row points to the parallel_deterministic_reduce construct in the VectorReduction function that has the highest scheduling overhead. Try to make work chunks more coarse-grain to eliminate the overhead in parallelization with this construct.

NOTE:

To better visualize the imbalance during the application run, explore the Timeline view. Useful work shows up in green while the black color marks the time waste.

Increase Grain Size of Parallel Work

As mentioned earlier, the fine-grain chunks of work assigned to worker threads, cannot compensate the time spent by the scheduler on the work assignment. The default chunk size for parallel_deterministic_reduce that uses a simple partitioner is 1. This means that worker threads will take just one loop iteration to execute before turning back to the scheduler for a new portion of work. Consider increasing the minimal chunk size to 10000 (line 5 in the snippet below):

double VectorReduction( double *v, size_t n )
{

    return parallel_deterministic_reduce(
        blocked_range<double*>( v, v + n, 10000 ),
        0.f,
        [](const blocked_range<double*>& r, double value)->double {
            return std::accumulate(r.begin(), r.end(), value);
        },
        std::plus<double>()
    );
}

And re-run the Threading analysis:

You see that the elapsed time of the application has significantly reduced, average effective CPU utilization is ~38 logical cores (because the metric counts a warmup phase, compute phase CPU Utilization is closer to 80 cores), and CPU time spent on Intel TBB scheduling or other parallel work arrangement is negligible. An overall speedup of this small code modification is more than 10X in comparison with the original version of the application run without collection time.

NOTE:

To discuss this recipe, visit the developer forum.