Quantization is a very popular deep learning model optimization technique for improving inference speeds. It minimizes the number of bits required to represent either the weights or activations in a neural network. This is done by converting a set of real-valued numbers into their lower-bit data representations, such as INT8 and INT4, mainly during the inference phase with minimal to no loss in accuracy.
This article explains how to perform quantization using Intel® Neural Compressor and a code sample on how to perform INT8 quantization on a PyTorch model using Intel® Neural Compressor.
Quantization using Intel® Neural Compressor
Intel® Neural Compressor is a model-compression tool that helps speed up AI inference without sacrificing accuracy and it also provides three types of Quantization APIs:
- Post-training dynamic quantization
- Post-training static quantization
- Quantization-aware training (QAT)
The steps to perform static quantization on FP32 model using Intel Neural Compressor are:
- Prepare quantization configuration - Use PostTrainingQuantConfig class and set approach to static.
conf = PostTrainingQuantConfig(approach="static")
- Prepare calibration dataset - The calibration dataset should be able to represent the data distribution of unseen data. In general, preparing 100 samples is enough for calibration.
- Fit the model – Here, quantization is performed. Provide all necessary parameters like model to quantize, prepared configuration, calibration dataset and evaluation function.
q_model = fit(model, conf=conf, eval_func=eval_func, calib_dataloader=data_loader)
The steps to perform dynamic quantization on FP32 model using Intel Neural Compressor are:
- Prepare quantization criterions - Specify the min/max range of the bit representation and how much of the given performance can be sacrificed by the user. This can be done using TuningCriterion and AccuracyCriterion classes.
tuning_criterion = TuningCriterion(max_trials=5) accuracy_criterion = AccuracyCriterion(tolerable_loss=0.1)
- Prepare quantization configuration - Use PostTrainingQuantConfig class and set approach to dynamic. Also, add quantization criterions.
conf = PostTrainingQuantConfig(approach="dynamic", tuning_criterion=tuning_criterion, accuracy_criterion=accuracy_criterion)
- Fit the model - Like static quantization, user needs to provide model, configuration and evaluation function. The dataset is not needed.
q_model = fit(model, conf=conf, eval_func=eval_func)
The steps to perform quantization aware training on FP32 model using Intel Neural Compressor are:
- Prepare quantization configuration - Use QuantizationAwareTrainingConfig class:
conf = QuantizationAwareTrainingConfig()
- Prepare compression manager - provide model to optimize and prepared configuration:
compression_manager = prepare_compression(model, conf)
- Fit the model - Like static and dynamic quantization, user needs to provide model, configuration (in a form of compression manager) and evaluation function. Additionally training function is needed as a quantization happens in a training process.
q_model = fit(compression_manager=compression_manager, train_func=train_func, eval_func=eval_func)
Code sample
Dataset
For static quantization, dataset is required and in this code sample the IMDB dataset form Hugging Face is used. IMDB is a large movie review dataset for binary sentiment classification collected from the IMDB website. It contains 50,000 highly polar movie reviews split into two sets – training and testing, both with 25,000 texts. Every text has a label belonging to one of the classes - 0 meaning negative review, and 1 meaning positive.
Implementation
The code sample explains a real-world use case of text classification using a Hugging Face model. Here, we will first use a stock FP32 PyTorch model to generate predictions. Then, we will perform INT8 Quantization with easy-to-use APIs provided by Intel Neural Compressor to see how speedups can be gained over stock PyTorch on Intel® hardware.
1. Packages and helper functions
Import all necessary packages. The user needs to make sure to import Intel Neural Compressor version above v2.0.
The next step is to create some helper functions, which will later help us compare the results of the model before and after Intel Neural Compressor quantization:
- get_average_inference_time - to warm up the model and measure average model runtime.
- plot_speedup - to plot bar chart comparing the time taken by stock PyTorch model and time taken by quantized model.
2. Model
FP32 model is used in this code sample. Quantization will be performed on this model. We are using BERT model fine-tuned on IMDB dataset from Hugging Face available over the checkpoint ‘JiaqiLee/imdb-finetuned-bert-base-uncased’. Using this name, the user can load the configuration, tokenizer and model.
model_name = "JiaqiLee/imdb-finetuned-bert-base-uncased"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
3. Dataset
- Prepare the dataset - Use load_dataset function from datasets library.
from datasets import load_dataset data = load_dataset("imdb")
- Tokenize the dataset - The torch.utils.data.Dataset class IMDBDataset allows us to prepare a tokenized dataset easily.
from dataset import IMDBDataset text = data['test']['text'] labels = data['test']['label'] test_dataset = IMDBDataset(text, labels, tokenizer=tokenizer, data_size=1200)
- The last thing related to dataset is data loader creation:
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64)
4. Evaluation function
Creating an evaluation function is an important part while performing quantization with Intel Neural Compressor. The evaluation function contains metrics such as accuracy and f1-score that the user wants to preserve post quantization. In this case, the metric captures accuracy.
def eval_func(model_q):
test_preds = []
test_labels = []
for _, batch in enumerate(test_loader):
inputs, labels = batch
ids = inputs['input_ids']
mask = inputs['attention_mask']
pred = model_q(
input_ids=ids,
attention_mask=mask,
)
test_preds.extend(pred.logits.argmax(-1))
test_labels.extend(labels)
return accuracy_score(test_preds, test_labels)
5. Benchmark Stock PyTorch Model
Create a benchmark for the FP32 model to see the results of the quantization. We can benchmark model using prepared earlier helper function:
inference_time_stock = get_average_inference_time(model.eval(), data)
6. Dynamic quantization
Perform dynamic quantization as explained in the above section.
tuning_criterion = TuningCriterion(max_trials=5)
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.1)
conf = PostTrainingQuantConfig(approach="dynamic", tuning_criterion=tuning_criterion, accuracy_criterion=accuracy_criterion)
q_model = fit(model, conf=conf, eval_func=eval_func)
Then, measure the inference time of dynamically quantized model and compare it with the stock PyTorch version:
inference_time_optimized = get_average_inference_time(q_model.eval(), data)
plot_speedup(inference_time_stock, inference_time_optimized)
7. Static quantization
Perform the static post training quantization.
conf = PostTrainingQuantConfig(approach="static")
q_model = fit(model, conf=conf, eval_func=eval_func, calib_dataloader=test_loader)
And like dynamic quantization, we can measure inference time and compare it with the stock one.
inference_time_optimized = get_average_inference_time(q_model.eval(), data)
plot_speedup(inference_time_stock, inference_time_optimized)
What’s Next?
Perform quantization on a PyTorch model using Intel Neural Compressor and check out the performance improvements over stock PyTorch on Intel hardware. Download and try the AI Tools and Intel® Neural Compressor for yourself to build various end-to-end AI applications.
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.
Useful resources
- Intel AI Developer Tools and resources
- oneAPI unified programming model
- Official documentation - Intel® Neural Compressor
- An Easy Introduction to Intel® Neural Compressor Article
- YouTube - AI Model Optimization with Intel® Neural Compressor
- Official documentation - PyTorch* Optimizations from Intel
- Intel® Extension for PyTorch* - Documentation
- AI Concepts: Inference