Tutorial
Find Hotspots (L3 Cache)
The following steps show how to find hotspots at the L3 cache level.
As an example, the tutorial uses the sample application, tcc_cache_allocation_sample. Although this sample is already tuned with the cache allocation library, it can simulate an untuned application when configured to allocate the buffer in DRAM.
First, you will run the sample using DRAM and observe the number of cache misses. Then you will run the sample again, using a buffer in L3 cache. You can expect to see fewer cache misses.
Run the Sample Using DRAM (Baseline)
Make sure that you can ssh from your host system to the target system. Run ifconfig on the target to get the IP address.
On the target system, open a terminal window and run the Linux* tool stress-ng as a noisy neighbor:
taskset -c 3 stress-ng -C 10 --cache-level 3
On the host system, launch VTune™ Profiler and create a project.
In the WHERE section, specify the target system as follows:
Click the Browse button.
Select Remote Linux (SSH).
For SSH destination, specify the address of the target system root@<IP address> OR root@<hostname>.
Click the Deploy button if required.
In the WHAT section, specify the following information to run the sample:
For Application, type tcc_cache_allocation_sample.
For Application parameters, type --latency 300 --sleep 100000000.
When the HOW section is visible, configure the analysis as follows:
Click the Browse button.
Under Microarchitecture, select Memory Access.
Click the Copy button to customize the analysis.
Optional: Select Collect stacks:
Under Events configured for CPU, select MEM_LOAD_RETIRED.L3_MISS for 11th Gen Intel® Core™ and Intel® Xeon® W-11000E Series processors or LONGEST_LAT_CACHE.MISS for Intel Atom® x6000E Series processors. Set Sample After to 2000 and deselect other settings (the performance of the real-time system may be affected by interrupts caused by extra VTune™ Profiler counters).
Optional: Select Analyze loops to collect advanced information such as instruction set usage, and display analysis results by loops and functions.
Optional: Scroll down and select Analyze memory objects.
Click the Start button to run the analysis.
Go to the Event Count tab.
Maximize the screen if it is smaller than the full size of your monitor.
Select Grouping by Task Type/Function/Call Stack
At the top of the screen, find the MEM_LOAD_RETIRED.L3_MISS column. Click the column to sort the rows by number of cache misses. In this example, the function pointer_chase_read_workload_internal is at the top of the list with 42,000 misses, which means the function is the hotspot for this type of event, and the buffer is a candidate for the cache allocation library.
Now follow the instructions below to run the sample using the cache allocation library to allocate a buffer in L3 cache. Compare the results.
Run the Sample Using L3 Cache
At the top left of the screen, click the Configure Analysis button.
In the WHAT section, change the Application parameters to --latency 110 --sleep 100000000. This command allocates the buffer in L3 cache.
Click the Start button to run the analysis.
After the analysis is complete, go to the Event Count tab.
By using an L3 cache buffer, the number of cache misses for function pointer_chase_read_workload_internal is lower or not in the list, as in the screenshot below, because there were no misses for this function.