Intel® Advisor User Guide

ID 766448
Date 10/31/2024
Public
Document Table of Contents

How Big Should a Task Be?

The ideal task size is very dependent on the details of your program. Here are a few general considerations to keep in mind.

Task Overhead

In general, if your program can keep most of the cores on your system busy doing useful work, then it will be using the system about as efficiently as possible. There are two parts to this: keeping the cores busy, and doing useful work.

It takes time to start a new task. If your tasks are too small, then your program may spend more time creating tasks than it saves by running them in parallel - the cores are kept busy, but not doing useful work.

Load Balance

On the other hand, very large tasks can reduce parallelism: your parallel program cannot finish any more quickly than the longest-running task. A rule of thumb is to try to have the number of tasks in a parallel site be at least several times larger than the number of cores available, so that there will always be some work to do when a core is free.

Choosing the Right Level

You will often have the opportunity to create tasks at different loop nesting levels or function call depths. This may provide an easy way to choose your task size. For example, consider the C/C++ code:

 for (i = 0; i != N; ++i) {
    for (j = 0; j != N; ++j) {
        x[i, j] = y[i, j] * z[j, i];
    }
 }

The inner loop body is too small to be a useful task. You can view the Suitability Report for a task's Average Instance Time. The entire inner loop might be more suitable:

ANNOTATE_SITE_BEGIN(sitename);
for (i = 0; i < N; ++i) {
    ANNOTATE_ITERATION_TASK(task_process_array);
    for (j = 0; j < N; ++j) {
        x[i, j] = y[i, j] * z[j, i];
    }
}
ANNOTATE_SITE_END();

Blocking

If you have a loop which seems like an obvious place to introduce parallelism, but the loop body is too small to make a good task, consider grouping several iterations together. When you specify a loop body as a parallel construct,Intel® oneAPI Threading Building Blocks and OpenMP* will automatically group multiple loop iterations together to create tasks of an appropriate size. Therefore, given a simple loop, the question is not whether the loop body is the right size for a good task, but whether the total loop execution time can be divided up into chunks of the right size.

For example, there is only one loop level here, and its body looks too small to be a good task:

 for (i = 0; i < 100000; ++i) {
    a[i] = b[i] * c[i];
}

Go ahead and choose it, and it may run as though you had written it as:

ANNOTATE_SITE_BEGIN(sitename);
for (i = 0; i < 100000; i += 1000) {
    ANNOTATE_ITERATION_TASK(task_calculate_a);
    for (j = i; j < i + 1000; ++j) {
        a[j] = b[j] * c[j];
    }
}
ANNOTATE_SITE_END();

Sizing to Avoid Interactions

It is not uncommon for loop iterations or other potential task bodies to be almost independent at one level, but have many interactions at other levels. In this case, it may be worth accepting a less than perfect program gain in exchange for simpler programming and cleaner code.

The outer loop of the Sudoku problem generator repeatedly calls the generate() function to generate problems. There are opportunities for introducing parallelism at many different levels in the problem generation function, but the individual calls to generate() are almost perfectly independent, and each call to generate() takes less than a second. Parallelizing the outermost loop would be a trivial project. No user is likely to care if it takes 0.8 seconds instead of 0.2 seconds to generate a single problem, and the speedup for generating more than a handful of problems should be nearly perfect.

Using the Survey Report

Ultimately, choosing your tasks is more of an art than a science. Locations close to the root of the call tree tend to form larger tasks, but may have more conflicts on shared variables; locations toward the leaves of the call tree tend to be smaller, causing problems with task overhead, but typically have fewer conflicts. We can offer some rules of thumb. Start by looking at a function F that uses a significant portion of the time of the program part you are trying to improve - remember Amdahl's law!

  • If almost all of the time spent in F is spent in a block of code that is executed many times in a loop, then that block of code may be a prime candidate for a data-parallel task.

  • If F is basically just a wrapper around a call to a function G, then look at G instead.

  • If almost all of the time in F is spent in multiple calls to a function G that is too large to be a good task, then you may want to enclose the calls to G in a parallel site, but introduce the actual tasks inside G or another function that is called from G.

  • If the time spent in F is distributed across a number of distinct activities, you should consider whether it is better to apply the task parallelism pattern to F, or to use the multiple parallel sites pattern to look for parallelism in each of the activities.

Recursion

Recursive algorithms can present a special challenge. The problem occurs when you have a large amount of time spent in a function that only does a small amount of work in any one invocation, but that is called recursively a great many times. The actual work may be data-parallel, but the function body is too small to be a useful task by itself, and the blocking strategy (see Blocking above) is harder to apply to a recursive algorithm.

The general solution is to use a threshold to control recursive parallelism. For example, a recursive sort might solve sub-problems in parallel only if they are above a certain threshold size.