Visible to Intel only — GUID: GUID-EF88A2B2-B966-428C-A90C-33BA7F70FA87
Package Contents
Parallelizing Simple Loops
Parallelizing Complex Loops
Parallelizing Data Flow and Dependence Graphs
Work Isolation
Exceptions and Cancellation
Containers
Mutual Exclusion
Timing
Memory Allocation
The Task Scheduler
Design Patterns
Migrating from Threading Building Blocks (TBB)
Constrained APIs
Invoke a Callable Object
Appendix A Costs of Time Slicing
Appendix B Mixing With Other Threading Packages
References
parallel_for_each Body semantics and requirements
parallel_sort ranges interface extension
TBB_malloc_replacement_log Function
Parallel Reduction for rvalues
Type-specified message keys for join_node
Scalable Memory Pools
Helper Functions for Expressing Graphs
concurrent_lru_cache
task_group extensions
The customizing mutex type for concurrent_hash_map
Waiting for Single Messages in Flow Graph
Visible to Intel only — GUID: GUID-EF88A2B2-B966-428C-A90C-33BA7F70FA87
Automatic Chunking
A parallel loop construct incurs overhead cost for every chunk of work that it schedules. oneAPI Threading Building Blocks (oneTBB) chooses chunk sizes automatically, depending upon load balancing needs. The heuristic attempts to limit overheads while still providing ample opportunities for load balancing.
CAUTION:
Typically a loop needs to take at least a million clock cycles to make it worth using parallel_for. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.
The default automatic chunking is recommended for most uses. As with most heuristics, however, there are situations where controlling the chunk size more precisely might yield better performance.