In recent years, the complexity, capabilities, and applications of large language models (LLMs) have all greatly increased. As LLMs have grown in complexity and intelligence, so has their size with the number of parameters, weights and activations continuing to rise. However, increasing the number of potential deployment targets and reducing inference costs typically require us to compress LLMs while not materially compromising their performance. There are various techniques to decrease the size of large neural networks, including large language models - one such important technique is quantization.
In this article, we present a code sample on how to perform (INT8 and INT4) quantization on an LLM (Intel/neural-chat-7b model) with Weight Only Quantization (WOQ) technique (using the Intel® Extension for Transformers tool).
What is Quantization?
Quantization is the process of going from using high-precision representation (like float32) for weights and/or activations to lower precision data types like float16, INT8 or INT4. Using lower precision can significantly reduce the memory requirement. In theory, this might look straightforward but there are many nuances that one needs to be aware of and the most important caveat to understand would be to compute data type. Not all operations support or have low-precision implementation, and we must scale the representation back to high precision at runtime to perform certain operations. This adds some extra overhead, but we can make use of tools such as Intel® Neural Compressor, OpenVINO™ toolkit and Neural Speed to reduce its impact. These runtimes have optimized implementations of various operators for low-precision data types, as a result, it is not required to upscale the values to high-precision leading to better performance while using less memory. The performance gains are significant if your hardware supports lower-precision data types. For example, 4th Gen Intel® Xeon® Scalable processors have built-in support for float16 and bfloat16.
Thus, quantization is only used to reduce the memory footprint of the model, but it can add some overhead during inference. To get performance benefits, along with memory benefits, one must use the latest hardware and make use of optimized runtimes.
Weight Only Quantization (WOQ)
There are several techniques for quantizing models. Generally, both model weights and activations (output values generated by each neuron in a layer) are quantized. However, one of these quantization techniques is Weight Only Quantization, which only quantizes the model weights, and the activations remain in their original precision. The obvious benefits are smaller memory footprint and faster inference. In practice, WOQ yields performance benefits without significantly impacting accuracy.
Code Implementation
The code sample illustrates the process of quantizing the Intel/neural-chat-7b-v3-3 language model. This model is a fine-tuned iteration of Mistral-7B, undergoes quantization utilizing Weight Only Quantization (WOQ) techniques provided by Intel® Extension for Transformers.
- It is easy for developers to unleash the power of Intel hardware for their Generative AI workloads with just one line of code change. Instead of importing AutoModelForCausualLM from Hugging Face transformers library, you import it from Intel Extension for Transformers and everything else remains as is.
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
- For INT8 quantization, just set load_in_8bit to True.
# INT8 quantization Q8_model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True)
- Similarly, for INT4 quantization set load_in_4bit to True.
# INT4 quantization q4_model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True)
The implementation is the same as employing Hugging Face transformers library.
The above code snippets will use BitandBytes for quantization if you set device to GPU. This helps your code run extremely fast whether you are using a CPU or GPU without any code change.
Running GGUF model
GGUF is a binary file format designed specifically for storing deep learning models, such as LLMs, particularly for inference on CPUs. It offers several key advantages, including efficiency, single-file deployment, and quantization. To make the most of our Intel hardware, we will be using the model in GGUF format.
Generally, to run models in GGUF format, one would need to use an additional library like Llama_cpp. However, since Neural Speed is built on top of Llama_cpp, you can continue using our Intel Extension for Transformers library to run GGUF models as well.
model = AutoModelForCausalLM.from_pretrained(
model_name=“TheBloke/Llama-2-7B-Chat-GGUF”,
model_file=“llama-2-7b-chat.Q4_0.gguf”
)
Try out the code sample for yourself. The code sample shows how to quantize an LLM model using Intel’s AI Tools, Intel® Extension for Transformers and how to make the most out of your Intel hardware when developing Generative AI based applications.
What’s Next?
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.
Useful resources
- Intel AI Developer Tools and resources
- oneAPI unified programming model
- GitHub: Intel® Extension for Transformers
- Official Documentation: Intel® Extension for Transformers
- Article: Intel neural-chat-7b Model Achieves Top Ranking on LLM Leaderboard
- AI Concepts: Generative AI
- AI Concepts: Inference
See AI Related Content
Articles
- Accelerate Text Generation with LSTM Using Intel® Extension for TensorFlow*
- Accelerate Deep Learning Framework Performance on Intel® Processors
- Optimize Sentence-Transformer Embedding Models with PyTorch*