Performance, Portability, and Productivity
This course uses oneAPI and Data Parallel C++ (DPC++) to demonstrate a method to achieve performant, portable code across several different platforms available on Intel® Developer Cloud.
Overview
Developers of high-performance computing applications are faced with an increasingly diverse number of computing platforms that feature multiple generations of CPUs, GPUs, FPGAs, and other accelerators. Developing code that is performant and portable across a diverse set of platforms can be expensive and time-consuming to achieve the best result.
Objectives
Who is this for?
This course is designed for developers who are familiar with SYCL* and who develop code that is expected to perform well in a heterogeneous environment. For a primer on SYCL, take the Essentials of SYCL course.
What will I be able to do?
You can apply the following examples and techniques to your own algorithms:
- Explore general matrix multiply (GEMM) algorithm examples using DPC++.
- Use several techniques to measure the effectiveness of applications across platforms.
- Use timer functions inside applications to measure kernel and compute times.
- Take kernel and compute time measurements to compute relative efficiency for the best implementation.
- Use the Roofline analysis and Intel® VTune™ Profiler to measure performance across platforms.
Modules
Introduction
- Explain how the oneAPI programming model can solve the challenges of programming in a heterogeneous world.
- Understand the SYCL programming model.
- Gain familiarity with Intel® oneAPI Math Kernel Library (oneMKL) and be able to use it for a two-dimensional GEMM algorithm.
Basic GEMM and Analysis Tools
- Use a basic GEMM algorithm application for the basis of performance enhancements.
- Identify the value of improved software architecture to minimize code changes.
- Interpret the Roofline and Intel VTune Profiler analysis results as a method to measure the GEMM applications.
NDRange Implementation for Matrix Multiplication
- Understand how NDRange improves parallelism over the basic parallel kernel implementation.
- Explain why local memory with NDRange can be advantageous.
- Explain the differences between work-groups and work-items.
Subgroups and Subgroups Using Local Memory
- Understand the advantages of using subgroups in DPC++.
- Take advantage of subgroups in NDRange kernel implementation.
- Take advantage of subgroups and local memory in NDRange kernel implementation.
Local Memory Implementation for Matrix Multiplication
- Describe the advantages of using local memory access over global memory access.
- Implement local memory.
- Articulate the advantages of using local memory.
Analysis of the Algorithms So Far
- Review the Roofline analysis and Intel VTune Profiler summary and performance results to determine the effectiveness of algorithm implementation across platforms.
- Understand the difference between compute and kernel measurements.
- Articulate why one workgroup size is not ideal.
Code Parameterization
- Articulate how to determine optimal work-group size based on an algorithm.
- Use SYCL to query for maximum work-group size and maximum number of compute units.
- Recognize the tradeoffs of using a library versus your own implementation.
Intel® VTune™ Profiler Analysis with Local GUI
- Install the correct version of Intel VTune Profiler locally.
- Collect Intel VTune Profiler analysis information from the remote server.
- Profile for both GPUs and CPUs.
- Understand how to move analysis data to a local system for analysis.
- Know where to go for additional Intel VTune Profiler training.