Visible to Intel only — GUID: GUID-740E8341-513A-41E2-9ECF-CADD73593448
Legal Information
Getting Help and Support
Introduction
Check-list for OpenCL™ Optimizations
Tips and Tricks for Kernel Development
Application-Level Optimizations
Debugging OpenCL™ Kernels on Linux* OS
Performance Debugging with Intel® SDK for OpenCL™ Applications
Coding for the Intel® Architecture Processors
Why Optimizing Kernels Is Important?
Avoid Spurious Operations in Kernels
Avoid Handling Edge Conditions in Kernels
Use the Preprocessor for Constants
Prefer (32-bit) Signed Integer Data Types
Prefer Row-Wise Data Accesses
Use Built-In Functions
Avoid Extracting Vector Components
Task-Parallel Programming Model Hints
Common Mistakes in OpenCL™ Applications
Introduction for OpenCL™ Coding on Intel® Architecture Processors
Vectorization Basics for Intel® Architecture Processors
Vectorization: SIMD Processing Within a Work Group
Benefitting from Implicit Vectorization
Vectorizer Knobs
Targeting a Different CPU Architecture
Using Vector Data Types
Writing Kernels to Directly Target the Intel® Architecture Processors
Work-Group Size Considerations
Threading: Achieving Work-Group Level Parallelism
Efficient Data Layout
Using the Blocking Technique
Intel® Turbo Boost Technology Support
Global Memory Size
Visible to Intel only — GUID: GUID-740E8341-513A-41E2-9ECF-CADD73593448
Avoid Needless Synchronization
For better results, avoid explicit command synchronization primitives, such as clEnqueueMarker and Barrier. Explicit synchronization commands and event tracking result in cross-module round trips, which decrease performance. The less you use explicit synchronization commands, the better the performance is.
Use the following techniques to reduce the explicit synchronization:
- Merge kernels whenever possible. It also improves data locality.
- If you need to wait for a kernel to complete execution before reading the resulting buffer, continue execution until you need the first buffer with results.
- If an in-order queue expresses the dependency chain correctly, use it to define a string of dependent kernels. In the in-order execution model, the commands in a command queue are executed in the order of submission, with each command running to completion before the next one begins. This is a typical case for a straightforward processing pipeline. Consider the following:
- Using the blocking OpenCL™ API is more effective than explicit synchronization schemes based on OS synchronization primitives.
- If you are optimizing the kernel pipeline, first measure kernels separately to find the most time-consuming one. Avoid calling clFinish or clWaitForEvents in the final pipeline version frequently after, for example, each kernel invocation. Prefer submitting the whole sequence (to the in-order queue) and issue clFinish once or wait on the OpenCL event object, which reduces host-device round trips.