Introduction to Natural Language Embeddings
Natural language embeddings are essential for natural language processing (NLP). They represent text as vectors that capture semantic information. These embeddings are crucial for various downstream NLP tasks, as they provide numerical input necessary for computational operations.
Among different embedding models, BGE (BAAI General Embedding) stands out for its efficiency. With its small and base versions, it strikes a balance between speed and effectiveness, making it an ideal choice for text embedding. Beyond text embedding, BGE integrates seamlessly with vector databases, further expanding its potential.
We present an innovative approach to enhance the performance of embedding models that leverages Intel® Extension for Transformers, an open-source tool to significantly improve the speed and accuracy of BGE small models.
INT8 Static Post-Training Quantization
The static post-training quantization (PTQ) is an effective approach to quantize additional training steps. It requires calibration using a representative dataset to determine the quantization parameters (e.g., scale, zero point) of the model. We apply PTQ with automatic accuracy-aware tuning on the bge-small-en-v1.5 to produce an optimal quantized model. The code snippets below show how to leverage post-training quantization to optimize the BGE-small model. The CQADupStack dataset is used for calibration, and the MTEB STS task is used as an evaluation benchmark. The complete code is available here. (See readme for documentation.)
from intel_extension_for_transformers.transformers import metrics, objectives, QuantizationConfig
from intel_extension_for_transformers.transformers.trainer import NLPTrainer
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(......)
trainer = NLPTrainer(......)
metric = metrics.Metric(
name="eval_accuracy", is_relative=True, criterion=0.01
)
objective = objectives.performance
q_config = QuantizationConfig(
approach="PostTrainingStatic",
metrics=[metric],
objectives=[objective]
)
model = trainer.quantize(quant_config=q_config, eval_func=mteb_sts_eval)
Unlock Faster NLP Inference with BGE
We now introduce a high-performance NLP backend designed to accelerate the inference of these BGE models without compromising accuracy. We leverage Neural Engine, a lightweight, bare-metal inference backend, to unlock the performance of compressed NLP models. It leverages both hardware and software optimizations to maximize performance.
To streamline the process for developers, Intel Extension for Transformers extends Hugging Face’s familiar transformer APIs with easy-to-use model compression tools. This seamless integration allows users to leverage Neural Engine’s capabilities and optimize NLP models for faster inference, boosting productivity. The following example shows how to get started:
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModel
sentences_batch = ['sentence-1', 'sentence-2', 'sentence-3', 'sentence-4']
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5')
encoded_input = tokenizer(sentences_batch,
padding=True,
truncation=True,
max_length=512,
return_tensors="np")
engine_input = [encoded_input['input_ids'], encoded_input['token_type_ids'], encoded_input['attention_mask']]
model = AutoModel.from_pretrained('./model_and_tokenizer/int8-model.onnx', use_embedding_runtime=True)
sentence_embeddings = model.generate(engine_input)['last_hidden_state:0']
print("Sentence embeddings:", sentence_embeddings)
Measuring Performance
We measured the optimal quantized BGE models on MTEB STS. The accuracy relative loss for all models is within 1% (Table 1):
Table 1. FP32 and INT8 accuracy of embedding models
We also measured embedding latency, which is the average milliseconds to encode one sentence using one socket, 24 cores/instance, one instance with sequence length = 512 and batch size = 1 (Figure 1):
Figure 1. Performance improvement of INT8 embedding models
Hardware configuration: Intel® Xeon® Platinum 8480+ processor, two sockets with 56 cores per socket, 2048 GB RAM (16 slots/128 GB/4800 MHz), HT:on. OS: Ubuntu* 22.04.2LTS; Software configuration: Python* 3.9, NumPy 1.26.3, ONNX Runtime 1.13.1, ONNX 1.13.1, Torch 2.1.0+cpu, Transformers 4.36.2. Testing date: 01/26/2024.
Building a Chatbot Using the Optimized BGE Model
We have extended the LangChain embedding API in Intel Extension for Transformers, enabling users to load quantized BGE models as shown below:
from intel_extension_for_transformers.langchain.embeddings import HuggingFaceBgeEmbeddings
embed_model = HuggingFaceBgeEmbeddings(model_name="/path/to/quantized/bge/model")
Furthermore, we have introduced a customizable chatbot framework called NeuralChat, which is part of Intel Extension for Transformers. This framework allows us to quickly build a chatbot on multiple architectures (e.g., Intel® Xeon® Scalable processors and Intel® Gaudi® AI accelerators). Here is an example showing how to use the quantized BGE small model to develop a chatbot application for knowledge retrieval:
from intel_extension_for_transformers.neural_chat import plugins
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import PipelineConfig
plugins.retrieval.enable = True
plugins.retrieval.args["input_path"] = "/path/to/docs"
plugins.retrieval.args["embedding_model"] = "/path/to/quantized/bge/model"
pipeline_config = PipelineConfig(model_name_or_path="facebook/opt-125m", plugins=plugins)
chatbot = build_chatbot(pipeline_config)
response = chatbot.predict(query="What is Intel extension for transformers?")
Concluding Remarks
We have demonstrated the effectiveness of quantization and optimization of embedding models using Intel Extension for Transformers to achieve better performance. We plan to explore other model compression techniques (e.g., pruning) to further improve inference efficiency without sacrificing quality. We encourage you to give it a try and explore other Intel® AI tools. Star the Intel Extension for Transformers repository to receive notifications about our latest optimizations. You are also welcome to create pull requests or submit issues to the repository. Feel free to contact us if you have any questions.