Quick step to improve the model load time on GPU
Loading an input model's Intermediate Representation (IR) to GPU takes longer than loading the same model to a CPU.
Manually create cl_cache directory in the working directory of your application.
The driver will use this directory to store the binary representations of the compiled kernels. This will work on all supported OSes.
Refer to this article for more information on managing the cl_cache.
Loading your input model in Intermediate Representation (IR) format to GPU takes longer than loading the same model to a CPU because the GPU stack is based on OpenCL*. The load time depends on the compilation time of OpenCL* kernels.
When you enable the cl_cache, the first time you load the model it will still take a long time because the OpenCL* kernel will compile. However, each subsequent load of the same model will be much faster.