Visible to Intel only — GUID: GUID-04D6A76D-4C42-4D4B-8C68-8D26BAAAE8E3
Legal Information
Getting Help and Support
Introduction
Check-list for OpenCL™ Optimizations
Tips and Tricks for Kernel Development
Application-Level Optimizations
Debugging OpenCL™ Kernels on Linux* OS
Performance Debugging with Intel® SDK for OpenCL™ Applications
Coding for the Intel® Architecture Processors
Why Optimizing Kernels Is Important?
Avoid Spurious Operations in Kernels
Avoid Handling Edge Conditions in Kernels
Use the Preprocessor for Constants
Prefer (32-bit) Signed Integer Data Types
Prefer Row-Wise Data Accesses
Use Built-In Functions
Avoid Extracting Vector Components
Task-Parallel Programming Model Hints
Common Mistakes in OpenCL™ Applications
Introduction for OpenCL™ Coding on Intel® Architecture Processors
Vectorization Basics for Intel® Architecture Processors
Vectorization: SIMD Processing Within a Work Group
Benefitting from Implicit Vectorization
Vectorizer Knobs
Targeting a Different CPU Architecture
Using Vector Data Types
Writing Kernels to Directly Target the Intel® Architecture Processors
Work-Group Size Considerations
Threading: Achieving Work-Group Level Parallelism
Efficient Data Layout
Using the Blocking Technique
Intel® Turbo Boost Technology Support
Global Memory Size
Visible to Intel only — GUID: GUID-04D6A76D-4C42-4D4B-8C68-8D26BAAAE8E3
Comparing OpenCL™ and Native Code Performance
When comparing an OpenCL™ kernel performance on CPU device with native code performance, make sure that both versions of code are as similar as possible. Consider the following guidelines:
- Wrap exactly the same set of operations.
- Do not include program build time in the kernel execution time. You can amortize this step by program precompilation using the clCreateProgramFromBinary call.
- Track data transfers costs separately.
- Use data mapping to make data transfers similar to the way data is passed in native code (by use of pointers). Refer to the Mapping Memory Objects (USE_HOST_PTR) section
- Ensure the working set is identical for native and OpenCL code.
- Make the memory access patterns equal (row-wise compared to column-wise).
- Demand the same accuracy. Consider the example for CPU device. rsqrt(x) is inherently of the higher accuracy than __mm_rsqrt_ps SSE intrinsic. To use the same accuracy in native code and OpenCL code, do one of the following:
- Equip __mm_rsqrt_ps in your native code with couple of additional Newton-Raphson iterations to match the precision of OpenCL™ rsqrt.
- Use native_rsqrt in your OpenCL™ kernel, which maps exactly to the rsqrtps instruction in the final assembly code.
- Use the relaxed-math compilation flag to enable similar accuracy for the whole program. Similarly to rsqrt, you can use the relaxed versions of rcp, sqrt, and so on. Refer to the Developer Guide for Intel® SDK for OpenCL™ Applications for the full list of supported functions.