Visible to Intel only — GUID: GUID-BF27CD7F-CA90-4821-AADC-20B14159D5F6
Legal Information
Getting Help and Support
Introduction
Check-list for OpenCL™ Optimizations
Tips and Tricks for Kernel Development
Application-Level Optimizations
Debugging OpenCL™ Kernels on Linux* OS
Performance Debugging with Intel® SDK for OpenCL™ Applications
Coding for the Intel® Architecture Processors
Why Optimizing Kernels Is Important?
Avoid Spurious Operations in Kernels
Avoid Handling Edge Conditions in Kernels
Use the Preprocessor for Constants
Prefer (32-bit) Signed Integer Data Types
Prefer Row-Wise Data Accesses
Use Built-In Functions
Avoid Extracting Vector Components
Task-Parallel Programming Model Hints
Common Mistakes in OpenCL™ Applications
Introduction for OpenCL™ Coding on Intel® Architecture Processors
Vectorization Basics for Intel® Architecture Processors
Vectorization: SIMD Processing Within a Work Group
Benefitting from Implicit Vectorization
Vectorizer Knobs
Targeting a Different CPU Architecture
Using Vector Data Types
Writing Kernels to Directly Target the Intel® Architecture Processors
Work-Group Size Considerations
Threading: Achieving Work-Group Level Parallelism
Efficient Data Layout
Using the Blocking Technique
Intel® Turbo Boost Technology Support
Global Memory Size
Visible to Intel only — GUID: GUID-BF27CD7F-CA90-4821-AADC-20B14159D5F6
Use Floating Point for Calculations
Intel® Xeon® processors significantly accelerate floating-point calculations on the device.
Consider the following code snippet that performs calculations in int:
__kernel void scale (__constant uchar* srcA, __constant uchar* srcB, __constant uchar nSaturation, __global uchar* dst) int offset = get_global_id(); uint tempSrcA = convert_uint(srcA[offset]);//Load one RGBA8 pixel uint tempSrcB = convert_uint(srcB[offset]);//Load one RGBA8 pixel //some processing uint tempDst = (tempSrcA - tempSrcB) * nSaturation; //store dst[offset] = convert_uchar(tempDst); }
The following example uses the float equivalent:
__kernel void scale (__constant uchar* srcA, __constant uchar* srcB, __constant uchar nSaturation, __global uchar* dst) int offset = get_global_id(); float tempSrcA = convert_float(srcA[offset]);//Load one RGBA8 pixel float tempSrcB = convert_float(srcB[offset]);//Load one RGBA8 pixel //some processing float tempDst = (tempSrcA - tempSrcB) * nSaturation; //store dst[offset] = convert_uchar(tempDst); }
Using built-in functions improves performance. See the Use Built-In Functions section for more information.
NOTE:
NOTE: The compiler is capable of automatic fusion of multiplies and adds. Use the -cl-mad-enable compiler flag to enable this optimization when compiling. Still, using explicit "mad" built-in ensures that the built-in is mapped directly to the efficient instruction.