The interest in modern natural language processing (NLP) frameworks has greatly increased in recent years. One such type of library or framework is a sentence-transformer. Sentence transformers are built on transformer architectures to create embeddings that encodes the semantic meaning of complete sentences. These models converts sentences into high-dimensional space vectors, which can be used for a variety of NLP tasks. It is important to optimize sentence transformer models to improve performance and efficiency.
The capabilities of sentence-transformer models can be enhanced by leveraging the PyTorch framework and bfloat16 (BF16) optimizations. This article provides a comprehensive guide on how to optimize your sentence-transformer embedding models for Intel architectures using Intel® Extension for PyTorch*, focusing on the utilization of BF16 optimizations in graph mode.
How to utilize Intel® Extension for PyTorch*
Intel works closely with the open source PyTorch project to upstream optimizations by default into the framework. The Intel extension enables the users to apply the newest performance optimizations that are not yet in PyTorch with minimal code changes. Learn how to install it standalone or get it as a part of the AI Tools. The extension can be loaded as a Python* module or linked as a C++ library. Python users can enable it dynamically by importing intel_extension_for_pytorch.
- The CPU tutorial gives detailed information of Intel Extension for PyTorch for Intel CPUs. Source code is available at the main branch.
- The GPU tutorial for detailed information of Intel Extension for PyTorch for Intel GPUs. Source code is available at the xpu-main branch.
BF16 is a floating-point format that occupies 16 bits of computer memory but represents the approximate dynamic range of 32-bit floating-point numbers. 4th Gen Intel® Xeon® Scalable processors and above support acceleration for the BF16 data format, which offers a balance between the precision of FP32 and the computational efficiency of lower precision formats. This balance makes BF16 particularly well-suited for deep learning tasks, where it can accelerate computation while maintaining the model's performance.
Intel’s optimizations for PyTorch provide enhanced performance through kernel optimizations, graph optimization, and support for BF16. By using Intel Extension for PyTorch in conjunction with BF16 optimizations, sentence-transformer models can achieve faster inference times and reduced memory usage, making them ideal for deployment on Intel's latest processors.
Intel® Advanced Matrix Extensions (Intel® AMX) is a dedicated hardware block found on the Intel® Xeon® Scalable processor core that helps optimize and accelerate deep learning training and inferencing workloads that rely on matrix math. Intel AMX introduces a new level of computational capability designed to accelerate deep learning inference and training tasks. When combined with Intel Extension for PyTorch and BF16 data format, it unlocks significant performance improvements for AI workloads.
To leverage the above optimizations more easily, we can wrap the Sentence Transformer model in a new class that enables Intel Extension for PyTorch optimizations in graph mode.
Benefits of Using Graph Mode with BF16
Graph mode execution with BF16 optimizations offers several advantages:
- Reduced Memory Footprint: By optimizing the model's operations and data types, the memory usage is significantly reduced, enabling the deployment of larger models or batch sizes.
- Increased Performance: Graph mode execution streamlines the model's computation graph, leading to faster inference times by eliminating unnecessary operations and leveraging Intel® AMX.
- Maintained Accuracy: Despite the reduced precision, the use of BF16 maintains the model's accuracy, making it a suitable optimization for most NLP tasks.
Code Sample
In the below code snippet, the SentenceTransformerGraphMode class initializes a SentenceTransformer model for processing sentences, optimizing it for performance by converting its datatype to bfloat16 and employing Intel Extension for PyTorch. The model is further compiled using PyTorch's JIT tracing, allowing for graph optimization and faster execution, and is set to use automatic mixed precision to efficiently handle different data types during computation.
import torch
import intel_extension_for_pytorch as ipex
from sentence_transformers import SentenceTransformer
class SentenceTransformerGraphMode:
"""
A class to optimize and use Sentence Transformers in a graph mode using JIT compilation and bfloat16 precision.
Attributes:
model (SentenceTransformer): An instance of the SentenceTransformer class optimized for inference.
compiled_model (torch.jit.ScriptModule): A JIT compiled and optimized model for faster inference.
Methods:
__init__(model_name: str, example_sentences: list): Initializes the transformer model with specified settings.
encode(sentences: list, batch_size: int = 32): Encodes given sentences into embeddings using the compiled model.
"""
def __init__(self, model_name: str, example_sentences: list):
"""
Initializes the SentenceTransformer model, optimizes it, and creates a JIT compiled version for fast inference.
Parameters:
model_name (str): The name of the Sentence Transformer model to load.
example_sentences (list): A list of example sentences used to trace the model for JIT compilation.
"""
# Initialize the Sentence Transformer model with the provided model name.
self.model = SentenceTransformer(model_name)
self.model.eval() # Set the model to evaluation mode.
# Optimize the model using Intel Extension for PyTorch* in bfloat16
self.model = ipex.optimize(self.model, dtype=torch.bfloat16)
# Tokenize the example sentences.
features = self.model.tokenizer(
example_sentences, return_tensors="pt", padding=True, truncation=True
)
# Enable automatic mixed precision for more efficient model inference.
with torch.cpu.amp.autocast():
# Trace the model
self.compiled_model = torch.jit.trace(
self.model,
(
{
"input_ids": features["input_ids"],
"attention_mask": features["attention_mask"],
}
),
strict=False,
)
self.compiled_model = torch.jit.freeze(self.compiled_model)
def encode(self, sentences: list, batch_size: int = 32, **kwargs):
"""
Encodes a list of sentences into embeddings using the optimized and compiled model.
Parameters:
sentences (list): A list of sentences to encode.
batch_size (int, optional): The number of sentences to process in each batch. Defaults to 32.
Returns:
torch.Tensor: A tensor containing the sentence embeddings produced by the model.
"""
inputs = self.model.tokenizer(
sentences, return_tensors="pt", padding=True, truncation=True
)
# Use the compiled model to compute sentence embeddings.
outs = self.compiled_model(inputs)
return outs["sentence_embedding"]
The encode method allows for the efficient encoding of sentences into embeddings.
# Model to load
model_name = 'BAAI/bge-large-en-v1.5'
# Example sentences for model tracing
example_sentences = [
'This framework generates embeddings for each input sentence',
'Another example sentence to help with the tracing process'
]
# Create an instance of SentenceTransformerGraphMode
graph_mode_model = SentenceTransformerGraphMode(model_name, example_sentences)
# Sentences we want to encode
sentences = ['This framework generates embeddings for each input sentence']
# Sentences are encoded by calling the encode method of the graph_mode_model
embeddings = graph_mode_model.encode(sentences)
# Print the embeddings
print(embeddings)
The example sentences serve as a proxy for what the model could expect to see in production. This helps TorchScript trace a path through the model and record the operations performed to create a graph. The code snippet shows how the SentenceTransformersGraphMode class can be used to optimize an embedding model with Intel Extension for PyTorch utilizing Intel AMX to accelerate operations that are performed in BF16. Additionally, converting the model to a static graph provides an additional boost as it incorporates optimizations that are hard or impossible in a dynamic computation graph, like the ones used natively in PyTorch. Intel Extension for PyTorch enables the most commonly used operator pattern fusion in graph mode and users can get the performance benefit without additional code changes.
Next Steps
You can harness the optimizations from Intel Extension for PyTorch and leverage the full capabilities of Intel's hardware innovations to enhance your NLP applications. Download and try the AI Tools and Intel Extension for PyTorch for yourself to build various end-to-end AI applications.
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.
For more details about 4th Gen Intel Xeon Scalable processors, visit Intel's AI Solution Platform portal where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.
Useful resources
- Intel AI Developer Tools and resources
- oneAPI unified programming model
- Official documentation - PyTorch* Optimizations from Intel
- Intel® Extension for PyTorch* - Documentation
- AI Concepts: Machine Learning
- AI Concepts: Inference
- AI Concepts: Computer Vision
See PyTorch Related Content
Articles
- Optimize Text and Image Generation Using PyTorch*
- How to Build an Interactive Chat-Generation Model using DialoGPT and PyTorch*
- Language Identification: Building an End-to-End AI Solution using PyTorch*
Code Samples
- Optimize PyTorch* Inference Performance on GPUs Using Auto-Mixed Precision
- Optimize PyTorch Models using Intel® Extension for PyTorch (IPEX) Quantization
- PyTorch Training Optimizations with Advanced Matrix Extensions Bfloat16
Get the Software
Download Intel Extension for PyTorch as a part of the AI Tools Selector, or you can get its standalone version.