Overview
As a data scientist or an AI developer, one of the common tasks is to optimize the deep learning models for inference. Intel® Neural Compressor is a tool that helps you easily perform model compression to reduce the model size and increase the speed of deep learning inference for deployment on Intel hardware.
This article describes a code sample on how to accelerate inference for a TensorFlow* model without sacrificing accuracy using Intel Neural Compressor.
Optimizing TensorFlow Model Inference
TensorFlow is one of the most popular deep learning frameworks and improving the inference performance of your TensorFlow model is an important part of optimizing your AI workflow. Intel Neural Compressor is an open source library that automates model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks. This Python* library can quantize activations and weights to int8, bfloat16, or a mixture of FP32, bfloat16, and int8 to reduce model size and accelerate inference while minimizing precision loss. Intel Neural Compressor requires four elements to run model quantization and tuning:
- Calibration dataloader – a class that allows to load dataset like images and corresponding labels
- Model – an FP32 model to be quantized
- Configuration file – a YAML file that specifies all necessary parameters
- Evaluation function – a function that returns the accuracy achieved by the model on a given dataset
Code Sample
This code sample shows the process of building a convolutional neural network (CNN) model to recognize handwritten numbers and demonstrates how to increase the inference performance by using Intel Neural Compressor. Intel Neural Compressor simplifies the process of converting the FP32 model to int8 or bfloat16 (BF16) and can achieve higher inference performance. In addition, Intel Neural Compressor tunes the quantization method to reduce the accuracy loss.
The following steps are implemented in the code sample:
- Setup
- Model training
- Quantization of the model using Intel Neural Compressor
- Performance comparison between models
Setup
- Import Python packages and verify that the correct versions are installed. The required packages are:
• TensorFlow 2.2 and later
• Intel Neural Compressor 1.2.1 and later
• Matplotlib - Enable the optimizations (for TensorFlow powered by optimizations from Intel) by setting the TF_ENABLE_MKL_NATIVE_FORMAT=0 environment variable for TensorFlow 2.5 and later. It needs to be set up before running Intel Neural Compressor to quantize FP32 model or deploying the quantized model.
Train a CNN Model Based on Keras
In the code sample, there is a Python script prepared to run all the training.
- Load the dataset. This sample uses an MNIST dataset of handwritten digits.
- Train the model with the dataset. The number of training epochs is 3.
- Freeze and save the model to single profobuf (*. pb file). Set the input node name to x.
Model Quantization using Intel Neural Compressor
Similar to the training process, a Python script is prepared for quantization. It contains all the steps needed to quantize and tune the model, as explained in a previous section.
- Define the Dataloader – a prepared class provides an iter function to return and image and label it as batch size. This sample uses the validation data of an MNIST dataset.
- Load the FP32 model that was saved previously.
- Define the configuration file. This sample uses the YAML file from oneAPI code samples on GitHub*. This file saves all necessary parameters for Intel Neural Compressor to perform quantization and tuning. For more information about the YAML configuration file, see the template from Intel Neural Compressor repository.
- Define the tuning function. Intel Neural Compressor can quantize the model with a validation dataset for tuning. As a result, it returns a frozen quantized int8 model. The defined function will do it based on the given configuration and FP32 model path.
- Define how the model will be written to the file. For this purpose, a function save_int8_frezon_pb has been created.
- Call function auto_tune to quantize the model. Remember to save the created model.
Compare Models
Python script profiling_inc.py is created to compare the performance of the FP32 and int8 models. The performance results are saved to the files specific to the used model. Additionally, Intel® Deep Learning Boost helps to speed up the int8 inference.
- Run the profiling_inc.py script with the original, FP32 model. The results are saved in 32.json file.
- Do the same with the int8 model. The results are saved in 8.json file.
- A summary of results for both the models is shown using the draw_bar function.
Get the Software
AI Tools
Accelerate data science and AI pipelines—from preprocessing through machine learning—and provide interoperability for efficient model development.
Intel® Neural Compressor
Speed up AI inference without sacrificing accuracy with this open source Python library that automates popular model compression technologies.