Visible to Intel only — GUID: GUID-DB4B85B6-3795-4330-ABBB-301E7D3DED36
Visible to Intel only — GUID: GUID-DB4B85B6-3795-4330-ABBB-301E7D3DED36
Getting Credible Performance Numbers
Performance measurements are done on a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, the minimum (or average, geometric mean, and so on) value for the execution time is usually used for final projections.
An alternative to calling kernel several times is using a single “warm-up” run.
The warm-up run might be helpful for kernels with small amount of computations, as it helps to amortize the following potential (one-time) costs:
- Bringing data to the cache
- Lazy object creation
- Delayed initializations
- Other costs, incurred by the OpenCL™ runtime
Consider the following:
- For bandwidth-limited kernels, operating on the data that does not fit in the last-level cache, the warm-up run does not improve the stability of measurement significantly.
- For a kernel with a small number of instructions executed over a small data set, make sure there is a sufficient number of iterations, so that the kernel run time is at least 20 milliseconds for CPU device.
- Kernels with smaller run time might provide unreliable data, so increasing the amount of computations artificially gives you important insights into the hotspots. For example, you can add loop in the kernel, or replicate some pieces.
Refer to the “OpenCL™ Optimizations Tutorial” SDK sample for code examples of performing warm-up run before starting performance measurement.