Elementwise

Intel® oneAPI Threading Building Blocks Developer Guide and API Reference

Download PDF

ID 772616

Date 11/07/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-25C5334B-238F-4805-A359-77E2FA4B95AD

View Details

Elementwise

Problem

Initiate similar independent computations across items in a data set, and wait until all complete.

Context

Many serial algorithms sweep over a set of items and do an independent computation on each item. However, if some kind of summary information is collected, use the Reduction pattern instead.

Forces

No information is carried or merged between the computations.

Solution

If the number of items is known in advance, use oneapi::tbb::parallel_for. If not, consider using oneapi::tbb::parallel_for_each.

Use agglomeration if the individual computations are small relative to scheduler overheads.

If the pattern is followed by a reduction on the same data, consider doing the element-wise operation as part of the reduction, so that the combination of the two patterns is accomplished in a single sweep instead of two sweeps. Doing so may improve performance by reducing traffic through the memory hierarchy.

Example

Convolution is often used in signal processing. The convolution of a filter c and signal x is computed as:

Serial code for this computation might look like:

// Assumes c[0..clen-1] and x[1-clen..xlen-1] are defined
for( int i=0; i<xlen+clen-1; ++i ) {
   float tmp = 0;
   for( int j=0; j<clen; ++j )
       tmp += c[j]*x[i-j];
   y[i] = tmp;
}

For simplicity, the fragment assumes that x is a pointer into an array padded with zeros such that x[k]returns zero when k<0 or k≥xlen.

The inner loop does not fit the elementwise pattern, because each iteration depends on the previous iteration. However, the outer loop fits the elementwise pattern. It is straightforward to render it using oneapi::tbb::parallel_for as shown:

oneapi::tbb::parallel_for( 0, xlen+clen-1, [=]( int i ) {
   float tmp = 0;
   for( int j=0; j<clen; ++j )
       tmp += c[j]*x[i-j];
   y[i] = tmp;
});

oneapi::tbb::parallel_for does automatic agglomeration by implicitly using oneapi::tbb::auto_partitioner in its underlying implementation. If there is reason to agglomerate explicitly, use the overload of oneapi::tbb::parallel_for that takes an explicit range argument. The following shows the example transformed to use the overload.

oneapi::tbb::parallel_for(
   oneapi::tbb::blocked_range<int>(0,xlen+clen-1,1000),
   [=]( oneapi::tbb::blocked_range<int> r ) {
         int end = r.end();
       for( int i=r.begin(); i!=end; ++i ) {
           float tmp = 0;
           for( int j=0; j<clen; ++j )
               tmp += c[j]*x[i-j];
           y[i] = tmp;
       }
   }
);

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Threading Building Blocks Developer Guide and API Reference

Elementwise