Address Memory Bandwidth Bottlenecks
This topic is part of a tutorial that shows how to use the automated Roofline chart to make prioritized optimization decisions.
Perform the following steps:
Key take-aways from these steps:
Memory bandwidth bottlenecks are generally overcome with cache optimizations.
Check data in other Intel Advisor views to support your Roofline chart interpretation.
These steps use a prepackaged analysis result because of tutorial duration and hardware dependency considerations.
Open a Result Snapshot
Do one of the following:
If you prefer to work in the standalone GUI, from the File menu, choose Open > Result and choose the Result1.advixeexpz result.
If you prefer to work in the Visual Studio* IDE, from the File menu, choose Open > File and choose the Result1.advixeexpz result.
Focus the Roofline Chart on the Data of Most Interest
Use the display toggles to show the Roofline chart and Survey Report side by side.
On the Intel Advisor toolbar, click the Loops And Functions filter drop-down and choose Loops.
In the Roofline chart:
Select the Use Single-Threaded Loops checkbox.
Click the control, then deselect the Visibility checkbox for all SP... roofs. (All variables in this sample code are double-precision, so there is no need to clutter the chart with single-precision rooflines.)
In the Point Colorization section, choose Colors of Point Weight Ranges to differentiate dot colors by runtime (red, yellow, and green).
Click to save your changes.
Click the control. In the x-axis fields, backspace over the existing values and enter 0.1 and 0.4. In the y-axis fields, backspace over the existing values and enter 7.4 and 45.5. Click the button to save your changes.
Interpret Roofline Chart Data
In the Roofline chart, notice the dot representing the loop in main at roofline.cpp:295 (the lower dot): It is positioned above the (offscreen) Scalar Add Peak roofline, and on the L2 Bandwidth roofline.
Why is the dot positioned there?
The probable answer: Loop performance is limited by a memory bandwidth bottleneck involving L2 cache.
How can we verify this?
Check the Survey Report:
Notice the Vectorized Loops/Efficiency value for the loop in main at roofline.cpp:295: 100%.
This 100% vectorization efficiency is why the dot is above the offscreen Scalar Add Peak roofline.
Click the data row for the loop in main at roofline.cpp:295 to view the associated source code in the Source tab.
In the Source tab, scroll to source code lines 89-96 to view the associated data structure definition: Structure of Arrays (SOA).
SOA is a good data layout for vectorization efficiency; however, our familiarity with the sample code tells us this data layout is preventing the tutorial dataset from fitting into L1 cache and causing many loads from L2 cache. (For details on why this is happening, check out this video: Roofline Analysis in Intel® Advisor 2017.)
So the loop in main at roofline.cpp:295 is positioned on the L2 Bandwidth roofline because loop performance is indeed limited by a memory bandwidth bottleneck involving L2 cache.
How can we eliminate this memory bandwidth bottleneck?
Reorganizing code to optimize cache usage is a possible optimization technique.
The loop in main at roofline.cpp:310 does this very thing, which is why the corresponding dot (upper dot in the Roofline chart) is positioned above the L2 Bandwidth roofline:
In the Survey Report, click the data row for the loop in main at roofline.cpp:310.
In the Source tab, scroll to code lines 97-101 to view the data structure definition for this loop: Array of Structure of Arrays (AOSOA). When the loop in main at roofline.cpp:310 is in the AOSOA data layout, our familiarity with the sample code tells us the tutorial workload is split into two steps, and each step has a dataset that fits into L1 cache.