Skip To Main Content
Intel logo - Return to the home page
My Tools

Select Your Language

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
Sign In to access restricted content

Using Intel.com Search

You can easily search the entire Intel.com site in several ways.

  • Brand Name: Core i9
  • Document Number: 123456
  • Code Name: Emerald Rapids
  • Special Operators: “Ice Lake”, Ice AND Lake, Ice OR Lake, Ice*

Quick Links

You can also try the quick links below to see results for most popular searches.

  • Product Information
  • Support
  • Drivers & Software

Recent Searches

Sign In to access restricted content

Advanced Search

Only search in

Sign in to access restricted content.

The browser version you are using is not recommended for this site.
Please consider upgrading to the latest version of your browser by clicking one of the following links.

  • Safari
  • Chrome
  • Edge
  • Firefox

Intel® Neural Compressor

Speed Up AI Inference without Sacrificing Accuracy

  • Features
  • Documentation & Code Samples
  • Training
  • Specifications
  • Help

Deploy More Efficient Deep Learning Models

Intel® Neural Compressor performs model optimization to reduce the model size and increase the speed of deep learning inference for deployment on CPUs, GPUs, or Intel® Gaudi® AI accelerators. This open source Python* library automates popular model optimization technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.

Using this library, you can:

  • Converge quickly on quantized models through automatic accuracy-driven tuning strategies.
  • Prune the least important parameters for large models.
  • Distill knowledge from a larger model to improve the accuracy of a smaller model for deployment.

Intel Neural Compressor is part of the end-to-end suite of Intel® AI and machine learning development tools and resources.

 

Download the AI Tools

Intel Neural Compressor is available in the AI Tools Selector, which provides accelerated machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python* libraries.

Get the Tools Now
Download the Stand-Alone Version

A stand-alone download of Intel Neural Compressor is available. You can download binaries from Intel or choose your preferred repository.

Download

      

Help Intel Neural Compressor Evolve

This open source component has an active developer community. We welcome you to participate.

Open Source Version (GitHub*)

Features

architecture of the Intel Neural Compressor

Model Optimization Techniques

  • Quantize activations and weights to int8, FP8, or a mixture of FP32, FP16, FP8, bfloat16, and int8 to reduce model size and to speed inference while minimizing precision loss. Quantize during training, posttraining, or dynamically, based on the runtime data range.
  • Prune parameters that have minimal effect on accuracy to reduce the size of a model. Configure pruning patterns, criteria, and schedule.
  • Automatically tune quantization and pruning to meet accuracy goals.
  • Distill knowledge from a larger model (“teacher”) to a smaller model (“student”) to improve the accuracy of the compressed model.
  • Customize quantization with advanced techniques such as SmoothQuant, layer-wise quantization, and weight-only quantization (WOQ) for low-bit inference.

Automation

  • Achieve objectives with expected accuracy criteria using built-in strategies to automatically apply quantization techniques to operations.
  • Combine multiple model optimization techniques with one-shot optimization orchestration.

 

Interoperability

  • Optimize and export PyTorch* or TensorFlow* models.
  • Optimize and export Open Neural Network Exchange (ONNX*) Runtime models with Intel Neural Compressor 2.x. As of version 3.x, Intel Neural Compressor is upstreamed into open source ONNX for built-in cross-platform deployment.
  • Use familiar PyTorch, TensorFlow, or Hugging Face* Transformer style APIs to configure and autotune model compression.
  • TensorFlow int8 Quantization
  • PyTorch int8 Post-training Quantization
  • PyTorch int8 Quantization-aware Training
  • ONNX Runtime Int8 Post-Training Quantization
  • PyTorch Pruning

Case Studies

Palo Alto Networks Reduces Inference Latency by 6x

To deliver the required response speed for multiple cybersecurity models, Palo Alto Networks quantized their models to int8, taking advantage of advanced instruction sets and accelerators.

Learn More

Sustainable AI with Intel®-Optimized Software and Hardware

HPE Services applied Intel AI software together with int8 post-training static quantization to reduce energy consumption by at least 68% across multiple experiments.

Learn More

delphai* Accelerates Natural Language Processing Models for Search Engines

By quantizing its models to int8, delphai* accelerated inference speed without sacrificing accuracy, enabling the use of less costly CPU-based cloud instances.

Learn More

Demonstrations

Microscaling (MX) Quantization

Quantize Microsoft* Floating Point (MSFP) data types to 8-, 6-, or 4-bit MX data types while balancing accuracy and memory consumption.

Learn More

The AutoRound Quantization Algorithm

Achieve near-lossless weight-only quantization (WOQ) compression for popular large language models (LLMs).

Learn More

Quantize LLMs with SmoothQuant

LLMs tend to have large-magnitude outliers in certain activation channels. Learn how the SmoothQuant technique addresses this and how to use it to quantize a Hugging Face* Transformer model to 8-bit.

Learn More

Quantize Large Language Models with Just a Few Lines of Code

Quantizing LLMs to int4 reduces model size up to 8x, speeding inference. Learn how to get started applying weight-only quantization (WOQ) and see the accuracy impact on popular LLMs.

Learn More

Distill and Quantize BERT Text Classification

Perform knowledge distillation of the BERT base model and quantize to int8 using the Stanford Sentiment Treebank 2 (SST-2) dataset. The resulting BERT-Mini model performs inference up to 16x faster.

Learn More

Quantization in PyTorch Using Fine-Grained FX

Convert an imperative model into a graph model, and perform dynamic quantization, quantization-aware training, or post training static quantization.

Learn More

Documentation & Code Samples

Documentation

  • Installation Guide (All Operating Systems)
  • Documentation & Tutorials
  • Tuning Strategies
  • API Documentation
  • Release Notes
  • System Requirements

 

View All Documentation

Code Samples

  • Get Started
  • Model Optimization: TensorFlow | PyTorch | ONNX Runtime
  • Meta* Llama 2 7B Weight-Only Quantization
  • ResNet*-18 Mixed Precision
  • Optimize VGG19 Model Inference on 4th Gen Intel Xeon Scalable Processors
     

More Samples

Training & Tutorials

Get Started with AI Model Optimization

Perform Dynamic Quantization on a Pretrained PyTorch Model

Quantize During Fine-Tuning with Hugging Face Optimum

Perform Structured Pruning on Transformer-Based Models

Specifications

Processor:

  • Intel Xeon processor
  • Intel Xeon CPU Max Series
  • Intel® Core™ Ultra processor
  • Intel Gaudi AI accelerator
  • Intel® Data Center GPU Max Series

Operating systems:

  • Linux*
  • Windows*

Language:

  • Python

Get Help

Your success is our success. Access this support resource when you need assistance.

  • AI Tools Support Forum
  • Deep Learning Frameworks Support Forum
     

For additional help, see the general Support.

Related Products

All AI Development Tools and Resources

  • Company Overview
  • Contact Intel
  • Newsroom
  • Investors
  • Careers
  • Corporate Responsibility
  • Inclusion
  • Public Policy
  • © Intel Corporation
  • Terms of Use
  • *Trademarks
  • Cookies
  • Privacy
  • Supply Chain Transparency
  • Site Map
  • Recycling
  • Your Privacy Choices California Consumer Privacy Act (CCPA) Opt-Out Icon
  • Notice at Collection

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration, and other factors. Learn more at intel.com/performanceindex. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Intel Footer Logo