Intel® oneAPI DPC++/C++ Compiler Boosts PyTorch* Inductor Performance on Windows* for CPU Devices

Stay in the Know on All Things CODE

author-image

By

Introduction

The performance of PyTorch-based AI applications and scalability across different platforms can greatly benefit from native C++ kernels that they use, taking full advantage of the underlying hardware’s capabilities. This can be achieved by building these native kernels with compilers that support the latest architecture extensions and code-path optimizations.  That way, you can ensure the desired level of performance and responsiveness of your AI application on the hardware platform of your choice.

In this brief discussion, we will discuss how to achieve this on a 12th Gen. Intel® Core™ processor-based computer running Microsft Windows*. We will build Python reference benchmark applications using the Microsoft* Visual C++ Compiler as well as the Intel oneAPI DPC++/C++ Compiler.

TorchInductor is the default compiler backend of PyTorch* 2.x.[1] compilation that compiles a PyTorch program into C++ kernels for CPU.  It offers a significant speedup over eager mode execution through graph-level optimization powered default Inductor backend. Compilers applied to C++ kernels in TorchInductor can play an important role in accelerating compute-intensive AI and ML workloads and scaling them on x86 platforms.

In this tutorial, we introduce the steps for utilizing Intel® oneAPI DPC++/C++ Compiler [2] to speed up TorchInductor (C++ kernels) on Windows for CPU devices. We also provide some performance testing results for popular PyTorch benchmarks (i.e., Torchbench*, HuggingFace*, and TIMM* models) on typical x86 client machines.

Software Installation

  1. Download and install : Microsoft* Visual Studio Community 2022
    Install runtime libraries from Microsoft Visual C++
    Download link: https://visualstudio.microsoft.com/downloads/
    Select "Desktop development with C++" and install.

     
  2. Download and install the Intel oneAPI DPC++/C++ Compiler:
    Download link: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
    Select "Intel oneAPI Base Toolkit for Windows* (64 Bit)" and install.

     
  3. Download and install Miniforge 64-bit for Windows
    Download Link: https://conda-forge.org/miniforge/
    Select "Miniforge3 Windows” and install.

Setup the Intel® oneAPI DPC++/C++ Compiler with the PyTorch Environment

  1. Launch build console:
    Launch the “Intel oneAPI command prompt for Intel 64 for Visual Studio 2022” console.

     
  2. Activate Conda-Forge:
    Conda-Forge is a powerful command line tool for package and environment management that runs on Windows.

     
  3. Create a Conda-Forge space for use with the Intel® oneAPI DPC++/C++ Compiler:
    Create a separate compiler environment to install PyTorch
    conda create -n icx python==3.11
    conda activate icx


     
  4. Install PyTorch 2.5 or later:
    Note: only version 2.5 and above support Windows PyTorch Inductor.
    pip install torch

Run PyTorch with Intel® oneAPI DPC++/C++ Compiler

  1. Set Intel® oneAPI DPC++/C++ Compiler as Windows Inductor C++ Compiler via environment variable: “CXX” as below.  The Microsoft Visual C++ Compiler is the default compiler for PyTorch, and PyTorch expects the compiler flag syntax on Windows to be the same as this default. Therefore, we need to use the “icx-cl” compiler command line driver as described in the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference.

     
  2. Using TorchInductor with Intel® oneAPI DPC++/C++ Compiler on Windows CPU: Here’s a simple example to demonstrate how to use Intel® oneAPI DPC++/C++ Compiler for TorchInductor.

Use Intel® oneAPI DPC++/C++ Compiler to Boost PyTorch Performance

The Intel® oneAPI DPC++/C++ Compiler deploys industry-leading compiler technology to generate an optimized binary for x86 and provide higher performance for PyTorch Inductor. We test popular PyTorch benchmarks: Torchbench*, HuggingFace*, and TIMM* models as below:

Speedup = Inductor_Perf/Eager_Perf

 

Default Wrapper Speedup Over Eager Mode

(Higher is Better)

Workload

Microsoft Visual C++ Compiler (default)

Intel® oneAPI DPC++/C++ Compiler

Torchbench

1.132

1.389

HuggingFace

0.911

1.351

Timm_models

1.365

1.720

 

  1. Torchbench Performance Results:

     
  2. HuggingFace Performance Results:

     
  3. Timm_models Performance Results:

Product and Performance Information

Testing Date: Performance results are based on testing by Intel as of October 11, 2024 and may not reflect all publicly available security updates.

Configuration Details:

OS version: Windows 11 Enterprise Version 22H2 (OS Build 22621.4169)

Compiler versions:

Microsoft Visual C++ Compiler: 19.41.34120

Intel oneAPI DPC++/C++ Compiler: 2024.2.1

Python version: 3.11.0

Torch version: 2.5.0+cpu

Machine configurations: 12th Gen Intel® Core™ i9-12900K Processor, Performance-core Base Frequency: 3.20 GHz, Sockets: 1, Total Cores: 16, Total Threads: 24, Virtualization: Enabled, L1 cache: 1.4 MB, L2 cache: 14.0 MB, L3 cache: 30.0 MB, Memory: 32.0 GB, Speed: 4800 MT/s, Slots used: 2 of 4, Form factor: DIMM

Conclusion

This tutorial shows how to use the Intel® oneAPI DPC++/C++ Compiler to speed up PyTorch* Inductor on Windows* for CPU devices. In addition, we share some performance testing results for popular PyTorch benchmarks (i.e., Torchbench*, HuggingFace*, and TIMM* models) on typical x86 Windows client machines with both Microsoft Visual C++ Compiler and Intel® oneAPI DPC++/C++ Compiler.

Download the Compiler Now 

You can download the Intel oneAPI DPC++/C++ Compiler on Intel’s oneAPI Developer Tools product page

This version is also in the Intel® oneAPI Base Toolkit, which includes an advanced set of foundational tools, libraries, analysis, debug and code migration tools.

You may also want to check out our contributions to the LLVM compiler project on GitHub.

Additional Resources

[1] get-started pytorch-2.0

[2] Intel® oneAPI DPC++/C++ Compiler

[3] Get the Intel® oneAPI Base Toolkit

[4] Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference

Notices & Disclaimers:

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer:
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.