Increasing AI Performance and Efficiency with Intel® DL Boost

Intel® Deep Learning Boost (Intel® DL Boost) is a group of acceleration features introduced in our 2nd Generation Intel® Xeon® Scalable processors. It provides significant performance increases to inference applications.

author-image

By

AUTHOR:
Huma Abidi

PyTorch acceleration baked into the latest generation of Intel Xeons.
That will help speed up the 200 trillion predictions and 6 billion translations Facebook does every day.
https://facebook.com/yann.lecun/pos...

Yann LeCun

Intel® Deep Learning Boost (Intel® DL Boost) is a group of acceleration features introduced in our 2nd Generation Intel® Xeon® Scalable processors. It provides significant performance increases to inference applications built using leading deep learning frameworks such as PyTorch*, TensorFlow*, MXNet*, PaddlePaddle*, and Caffe*. [1]

Understanding Intel® Deep Learning Boost

Intel DL Boost follows a long history of Intel adding acceleration features to its hardware to increase the performance of targeted workloads. The initial Intel Xeon Scalable processors included a 512-bit-wide Fused Multiply Add (FMA) instruction in the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions it introduced. This FMA instruction facilitated increases in data parallelism and helped Intel Xeon Scalable processors deliver 5.7x performance for AI and deep learning applications. [2]

With Intel DL Boost, we build upon this foundation to further accelerate AI on Intel® architecture. The first of several innovations planned for Intel DL Boost are the Vector Neural Network Instructions (VNNI), which have two main benefits to deep learning applications:

  • VNNI use a single instruction for deep learning computations that formerly required three separate instructions. As you would expect, using one instruction in place of three yields significant performance benefits.
  • VNNI enable INT8 deep learning inference. INT8’s lower precision increases power efficiency by decreasing compute and memory bandwidth requirements. INT8 inference has produced significant performance benefits with little loss of accuracy. [3] Intel DL Boost and related tools we provide significantly ease the use of INT8 inference and are accelerating its adoption by the broader industry.

Intel DL Boost helps contribute to a theoretical peak speedup of 4x for INT8 inference on 2nd Gen Intel Xeon Scalable processors, [4] in comparison to FP32 inference. Further, with 56 cores per socket, we are predicting that Intel Xeon Platinum 9200 processors will deliver up to twice the performance of Intel Xeon Platinum 8200 processors. [5] In fact, Intel Xeon Platinum 9282 processors recently demonstrated the ability to exceed the performance of NVIDIA* Tesla* V100 on ResNet-50 inference.

You can learn more about the technical details of VNNI from our recent Intel.ai blog on the topic.

Why Intel DL Boost?

The rapid proliferation of AI inference services, the need for these services to render results quickly, and the tendency for increasingly complex deep learning applications to be processor-intensive are helping drive unprecedented demand for high-performance, low-latency compute. It is often easiest and most efficient to meet this demand with IT infrastructure already in place or already familiar – the Intel Xeon Scalable processor-based systems trusted for so many other workloads. Fortunately, as customers and researchers have shown time and again, Intel architecture makes a highly performant platform for AI inference.

With Intel DL Boost, Intel architecture and 2nd Gen Intel Xeon Scalable processors are a more capable AI inference platform than ever before. Even better, Intel DL Boost’s innovation will continue in the next generation of Intel Xeon Scalable processors, in which we will introduce support for the bfloat16 floating-point format. Look for more information on this soon.

Getting Started with Intel DL Boost

We have been working with the AI community to optimize the most popular open source deep learning frameworks for Intel DL Boost to help developers benefit from the performance and efficiency gains it provides.

Developers can use tools Intel offers to convert a FP32 trained model to an INT8 quantized model. This new INT8 model will benefit from Intel DL Boost acceleration when used for inference in place of the earlier FP32 model and run on 2nd Gen Intel Xeon Scalable processors. As additional support, Intel also provides a Model Zoo, which includes INT8 quantized versions of many pre-trained models, such as ResNet101, Faster-RCNN, and Wide&Deep. We hope these models and tools get you up and running with Intel DL Boost more quickly.

Intel DL Boost in Action

Several earlier intel.ai blogs provide more details on Intel DL Boost integrations into various popular deep learning frameworks and results customers are seeing from early use of these optimizations.

TensorFlow: Our blog on accelerating TensorFlow inference explains how to use the Intel AI Quantization Tools for TensorFlow to convert a pre-trained FP32 model to a quantized INT8 model. Several pre-trained INT8 quantized models for TensorFlow are included in the Intel Model Zoo in categories like image recognition, object detection, and recommendation systems. Dell EMC has reported a greater than 3x improvement in performance over the initial Intel Xeon Scalable processors using our pretrained INT8 Resnet50 Model and 2nd Gen Intel Xeon Scalable processors with Intel DL Boost. [6]

PyTorch: Intel and Facebook have partnered to increase PyTorch performance with Intel DL Boost and other optimizations. With Intel DL Boost and 2nd Gen Intel Xeon Scalable processors, we have found up to 7.7x performance for a FP32 model and up to 19.5x performance for an INT8 model when running ResNet50 inference. [7] As a result of this collaboration, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimizations are integrated directly into the PyTorch framework, enabling optimization of PyTorch models with minimal code changes. We’ve provided a blog that should be a good way to get started with PyTorch on 2nd Gen Intel Xeon Scalable processors.

Apache MXNet: The Apache MXNet community has delivered quantization approaches to enable INT8 inference and use of VNNI. We have found 3.0x performance for ResNet-50 for an INT8 model with Intel DL Boost and 2nd Gen Intel Xeon Scalable processors in comparison to a FP32 on Intel Xeon Scalable processors. [8] iFLYTEK, which is leveraging 2nd Gen Intel Xeon Scalable processor and Intel® Optane™ SSDs for its AI applications, has reported that Intel DL Boost has resulted in similar or better performance in comparison to inference using alternative architectures.

PaddlePaddle: Intel and Baidu have collaborated since 2016 to optimize PaddlePaddle performance for Intel architecture. We’ve provided an in-depth overview of INT8 support in PaddlePaddle. In Intel’s testing, INT8 inference resulted in 2.8x throughput for ResNet-50 v1.5 with just 0.4% accuracy loss in comparison to an earlier FP32 model. [9]

Intel Caffe: Our Intel Caffe GitHub Wiki explains how to use low-precision inference to speed up performance without losing accuracy by using our Calibrator accuracy tool. JD.com collaborated with Intel engineers to use Intel DL Boost to increase the performance of a text detection application by 2.4x with no accuracy degradation in comparison to an earlier FP32 model. [10]

Enabling Cutting-Edge AI on Intel® Architecture

With all-new software libraries and optimizations, coupled with hardware innovation, CPUs have never been more performant for AI than they are today. My team and I look forward to continuing to deliver these AI breakthroughs on Intel architecture. To follow our work, please stay tuned to intel.ai and follow Intel AI on Twitter at @IntelAI and @IntelAIResearch.

[1] My thanks to my Intel colleagues Jayaram Bobba, Wei Wang, Ramesh AG, Eric Lin, Jason Ye, Jiong Gong, and Wei Li for their assistance with this post.

[2] Intel Xeon Scalable processors with Intel AVX-512 optimizations (December 2018) provide up to 5.7x performance in comparison to Intel Xeon Scalable processors at launch (July 2017), for details see https://bit.ly/2WLijLn, slides 14 and 32.

[3] https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training

[4] https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training

[5] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2x inference throughput in comparison to initial Intel Xeon Scalable processors, for details see https://bit.ly/2WLijLn, slides 14 and 32.

[6] Source: Dell EMC, https://blog.dellemc.com/en-us/accelerating-insight-2nd-generation-intel-xeon-scalable-processors/

[7] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 7.7x performance for a FP32 model and 19.5x performance for an INT8 model in comparison to a FP32 model on 2nd Gen Intel Xeon Scalable processors without Intel MKL-DNN optimization, for details see https://software.intel.com/en-us/articles/intel-and-facebook-collaborate-to-boost-pytorch-cpu-performance.

[8] MxNet on ResNet-50 Throughput Performance on Intel® Xeon® Platinum 8280 Processor: Tested by Intel as of 3/1/2019. 2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode:0x4000013),CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: MxNet https://github.com/apache/incubator-mxnet/ -b master da5242b732de39ad47d8ecee582f261ba5935fa9, Compiler: gcc 4.8.5,MKL DNN version: v0.17, ResNet50: https://github.com/apache/incubator-MXNet/blob/master/python/MXNet/gluon/model_zoo/vision/resnet.py, BS=64, synthetic data, 2 instance/2 socket, Datatype: INT8 vs Tested by Intel as of 3/1/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2633 MHz), BIOS: SE5C620.86B.0D.01.0286.121520181757, CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: MxNet https://github.com/apache/incubator-mxnet/ -b master da5242b732de39ad47d8ecee582f261ba5935fa9, Compiler: gcc 4.8.5,MKL DNN version: v0.17, ResNet50: https://github.com/apache/incubator-MXNet/blob/master/python/MXNet/gluon/model_zoo/vision/resnet.py, BS=64, synthetic data, 2 instance/2 socket, Datatype: INT8 and FP32

[9] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2.8x throughput for ResNet-50 v1.5 with just 0.4% accuracy loss, for details see https://www.intel.ai/int8-inference-support-in-paddlepaddle-on-2nd-generation-intel-xeon-scalable-processors/.

[10] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2.4x text detection performance, for details see https://bit.ly/2WLijLn, slides 15 and 33.

Intel, the Intel logo, Xeon, and Optane are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.