Automatic Chunking

Intel® oneAPI Threading Building Blocks Developer Guide and API Reference

Download PDF

ID 772616

Date 10/31/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

Intel(R) oneAPI Threading Building Blocks (oneTBB) Developer Guide and API Reference

Intel(R) oneAPI Threading Building Blocks (oneTBB) Developer Guide and API Reference x

Getting Help and Support Notational Conventions Introduction oneTBB Benefits Known Limitations oneTBB Developer Guide oneTBB API Reference Notices and Disclaimers

oneTBB Developer Guide x

Package Contents Parallelizing Simple Loops Parallelizing Complex Loops Parallelizing Data Flow and Dependence Graphs Work Isolation Exceptions and Cancellation Containers Mutual Exclusion Timing Memory Allocation The Task Scheduler Design Patterns Migrating from Threading Building Blocks (TBB) Constrained APIs Invoke a Callable Object Appendix A Costs of Time Slicing Appendix B Mixing With Other Threading Packages References

Package Contents x

Debug Versus Release Libraries Scalable Memory Allocator Windows* Linux*

Parallelizing Simple Loops x

Initializing and Terminating the Library parallel_for parallel_reduce Advanced Example Advanced Topic: Other Kinds of Iteration Spaces

parallel_for x

Lambda Expressions Automatic Chunking Controlling Chunking Bandwidth and Cache Affinity Partitioner Summary

Parallelizing Complex Loops x

Cook Until Done: parallel_for_each Working on the Assembly Line: parallel_pipeline Summary of Loops and Pipelines

Working on the Assembly Line: parallel_pipeline x

Using Circular Buffers Throughput of pipeline Non-Linear Pipelines

Parallelizing Data Flow and Dependence Graphs x

Parallelizing Data Flow and Dependency Graphs Basic Flow Graph Concepts Graph Application Categories Predefined Node Types Flow Graph Tips and Tricks Estimating Flow Graph Performance

Basic Flow Graph Concepts x

Flow Graph Basics: Graph Object Flow Graph Basics: Nodes Flow Graph Basics: Edges Flow Graph Basics: Mapping Nodes to Tasks Flow Graph Basics: Message Passing Protocol Flow Graph Basics: Single-push vs. Broadcast-push Flow Graph Basics: Buffering and Forwarding Flow Graph Basics: Reservation

Graph Application Categories x

Data Flow Graph Dependence Graph

Flow Graph Tips and Tricks x

Flow Graph Tips for Waiting for and Destroying a Flow Graph Flow Graph Tips on Making Edges Flow Graph Tips on Nested Parallelism Flow Graph Tips for Limiting Resource Consumption Flow Graph Tips for Exception Handling and Cancellation

Flow Graph Tips for Waiting for and Destroying a Flow Graph x

Always Use wait_for_all() Avoid Dynamic Node Removal Destroying Graphs That Run Outside the Main Thread

Flow Graph Tips on Making Edges x

Use make_edge and remove_edge Sending to One or Multiple Successors Communication Between Graphs Using input_node Avoiding Data Races

Flow Graph Tips on Nested Parallelism x

Use Nested Algorithms to Increase Scalability Use Nested Flow Graphs

Flow Graph Tips for Limiting Resource Consumption x

Using limiter_node Use Concurrency Limits Create a Token-Based System Attach Flow Graph to an Arbitrary Task Arena

Attach Flow Graph to an Arbitrary Task Arena x

Guiding Task Scheduler Execution Work Isolation

Flow Graph Tips for Exception Handling and Cancellation x

Catching Exceptions Inside the Node that Throws the Exception Cancel a Graph Explicitly Use graph::reset() to Reset a Canceled Graph Canceling Nested Parallelism

Exceptions and Cancellation x

Cancellation Without An Exception Cancellation and Nested Parallelism

Containers x

concurrent_hash_map concurrent_vector Concurrent Queue Classes Summary of Containers

concurrent_hash_map x

Automatic Chunking

A parallel loop construct incurs overhead cost for every chunk of work that it schedules. oneAPI Threading Building Blocks (oneTBB) chooses chunk sizes automatically, depending upon load balancing needs. The heuristic attempts to limit overheads while still providing ample opportunities for load balancing.

CAUTION:

Typically a loop needs to take at least a million clock cycles to make it worth using parallel_for. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

The default automatic chunking is recommended for most uses. As with most heuristics, however, there are situations where controlling the chunk size more precisely might yield better performance.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Threading Building Blocks Developer Guide and API Reference

Automatic Chunking