The Issue:
PyTorch provides an FX toolkit for developers to transform a torch.nn.Module into a torch.fx.GraphModule. With the generated GraphModule, FX can execute static quantization by automatically inserting quantize and dequantize operations.
It’s useful to convert an imperative model into a graph model because the latter gives better performance with multiple optimization options such as post-training static quantization.
However, FX cannot handle dynamic control flows automatically, and there are many cases that will block the model transformation because of dynamic control flows.
The Solution: Fine-Grained FX
Fine-grained FX helps models with dynamic control flows on ease-of-use quantization. It is integrated into the pytorch_fx backend of Intel Neural Compressor and supports three popular quantization methods:
- post-training dynamic quantization
- post-training static quantization
- quantization-aware training
PyTorch recommends post-training dynamic quantization for NLP models because its real-time variable scales and zero-points shows stable accuracy after quantization.
Post-training static quantization performs quantization based on fixed scales and zero-points. It supports continuous quantization modules, avoiding redundant quantization and dequantization operations.
Theoretically, static quantization has a better performance than dynamic quantization. Quantization-aware training for static quantization requires an additional training process to adjust model weights to reduce quantization loss. It can provide high accuracy based on the best performance of static quantization.
Because the imperative model consists of several blocks, fine-grained FX will aggressively and recursively detect these blocks for module transformation.
Two examples are shown below: natural language processing (Figure 1) and object detection (Figure 2).
The darker green blocks are detected as suitable for module transformation because they are the largest blocks without any control flow. We leverage the FX toolkit on these blocks and do quantization automatically. By reassembling these processed blocks using the original control flows, the resulting model maintains the same behavior and provides higher performance by leveraging INT8.
Figure 1. Fine-grained FX for BERT natural language processing
Figure 2. Fine-grained FX for YOLO-V2 object detection
Adopting Our Solution
Intel® Neural Compressor for NLP
We provide two kinds of examples for natural language processing models based on Hugging Face Transformers. You can easily replace the input model with your own and quantize it based on fine-grained FX:
- Post-training static quantization for text classification
- Quantization-aware training for text classification
Hugging Face Optimum-Intel
Optimum-Intel is an extension of Transformers that enable the use of popular compression techniques such as quantization and pruning via Intel Neural Compressor. All tasks in Optimum-Intel support fine-grained FX: language modeling, multiple choice, question answering, summarization, text classification, token classification, and translation.
We also uploaded several INT8 models into the Hugging Face model hub that can be easily initialized and leveraged with Intel Neural Compressor, e.g.:
from neural_compressor.utils.load_huggingface import OptimizedModel
int8_model = OptimizedModel.from_pretrained(
'Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-static',
)
Model Name |
Approach |
Accuracy (accuracy/f1) |
Model Size (MB) |
||||
INT8 |
FP32 |
Relative Loss |
INT8 |
FP32 |
Compression Ratio |
||
Post training static quantization |
0.9255 |
0.9232 |
-0.249% |
25 |
44.6 |
1.784 |
|
Post training dynamic quantization |
0.9051 |
0.912 |
0.757% |
547 |
1556.48 |
2.845 |
|
Post training static quantization |
0.7838 |
0.7915 |
0.973% |
133 |
418 |
3.143 |
|
Post training dynamic quantization |
0.8997 |
0.9042 |
0.498% |
174 |
418 |
2.402 |
|
Quantize aware training |
0.9142 |
0.9042 |
-1.106% |
107 |
418 |
3.907 |
|
Post training static quantization |
0.8997 |
0.9042 |
0.498% |
120 |
418 |
3.483 |
|
Post training dynamic quantization |
0.8843 |
0.8928 |
0.952% |
180 |
422 |
2.344 |
|
Post training static quantization |
0.9859 |
0.9882 |
0.233% |
64.5 |
253 |
3.922 |
|
Post training static quantization |
0.9037 |
0.9106 |
0.758% |
65 |
255 |
3.923 |
|
Post training static quantization |
0.9007 |
0.8983 |
-0.267% |
14 |
51.8 |
3.700 |
|
Post training static quantization |
0.9247 |
0.9138 |
-1.193% |
121 |
476 |
3.934 |
|
Post training static quantization |
0.8893 |
0.8897 |
0.045% |
215 |
448 |
2.084 |
Future Work
The vision of fine-grained FX is to improve the productivity of PyTorch quantization, especially of the static quantization approach. We are continuously uploading INT8 models to the Hugging Face model hub for quick deployment.
We invite users to:
- Try Intel Neural Compressor and Hugging Face Optimum-Intel and share your models on the model hub.
- Check out Intel’s other AI Tools and Framework optimizations.
- Learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.
See Related Content
On-Demand Webinars
- Accelerate AI Workloads with Intel® Optimization for PyTorch
- Improve IoT Inference with Quantization Techniques
- Accelerate AI Inference without Sacrificing Accuracy
Tech Articles & Blogs
- PyTorch Inference Acceleration with Intel® Neural Compressor
- Deep Learning Model Optimizations Made Easy (or at Least Easier)
- Accelerate PyTorch with Intel® Extension for PyTorch
- Increase PyTorch Inference Throughput by 4x
Get the Software
Intel® AI Analytics Toolkit
Accelerate end-to-end machine learning and data science pipelines with optimized deep learning frameworks and high-performing Python* libraries.
Get It Now
See All Tools