What’s new
- More GenAI coverage and framework integrations to minimize code changes.
- New models supported: Llama 3.2 (1B & 3B), Gemma 2 (2B & 9B), and YOLO11.
- LLM support on NPU: Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3 Mini-Instruct.
- Noteworthy notebooks added: Sam2, Llama3.2, Llama3.2 - Vision, Wav2Lip, Whisper, and Llava.
- Preview: support for Flax, a high-performance Python neural network library based on JAX. Its modular design allows for easy customization and accelerated inference on GPUs.
- Broader Large Language Model (LLM) support and more model compression techniques.
- Optimizations for built-in GPUs on Intel® Core™ Ultra Processors (Series 1) and Intel® Arc™ Graphics include KV Cache compression for memory reduction along with improved usability, and model load time optimizations to improve first token latency for LLMs.
- Dynamic quantization was enabled to improve first token latency for LLMs on built-in Intel® GPUs without impacting accuracy on Intel® Core™ Ultra Processors (Series 1). Second token latency will also improve for large batch inference.
- A new method to generate synthetic text data is implemented in the Neural Network Compression Framework (NNCF). This will allow LLMs to be compressed more accurately using data-aware methods without datasets. Coming soon: This feature will soon be accessible via Optimum Intel on Hugging Face.
- More portability and performance to run AI at the edge, in the cloud, or locally.
- Support for Intel® Xeon® 6 Processors with P-cores (formerly codenamed Granite Rapids) and Intel® Core™ Ultra 200V series processors (formerly codenamed Arrow Lake-S).
- Preview: GenAI API enables multimodal AI deployment with support for multimodal pipelines for improved contextual awareness, transcription pipelines for easy audio-to-text conversions, and image generation pipelines for streamlined text-to-visual conversions.
- Speculative decoding feature added to the GenAI API for improved performance and efficient text generation using a small draft model that is periodically corrected by the full-size model.
- Preview: LoRA adapters are now supported in the GenAI API for developers to quickly and efficiently customize image and text generation models for specialized tasks.
- The GenAI API now also supports LLMs on NPU allowing developers to specify NPU as the target device, specifically for WhisperPipeline (for whisper-base, whisper-medium, and whisper-small) and LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for best performance.
Now deprecated
-
Python 3.8 is no longer supported:
OpenVINO™ Runtime
Common
-
Numpy 2.x has been adopted for all currently supported components, including NNCF.
-
A new constant constructor has been added, enabling constants to be created from data pointer as shared memory. Additionally, it can take ownership of a shared, or other, object, avoiding a two-step process to wrap memory into ov::Tensor.
-
Files are now read via the async ReadFile API, reducing the bottleneck for LLM model load times on GPU.
-
CPU implementation of SliceScatter operator is now available, used for models such as Gemma, supporting increased LLM performance.
CPU Device Plugin
-
Gold support of the Intel® Xeon® 6 platform with P-cores (formerly code name Granite Rapids) has been reached.
-
Support of Intel® Core™ Ultra 200V series processors (formerly codenamed Arrow Lake-S) has been implemented.
-
LLM performance has been further improved with Rotary Position Embedding optimization; Query, Key, and Value; and multi-layer perceptron fusion optimization.
-
FP16 support has been extended with SDPA and PagedAttention, improving performance of LLM via both native APIs and the vLLM integration.
-
Models with LoRA adapters are now supported.
GPU Device Plugin
-
The KV cache INT8 compression mechanism is now available for all supported GPUs. It enables a significant reduction in memory consumption, increasing performance with a minimal impact to accuracy (it affects systolic devices slightly more than non-systolic ones). The feature is activated by default for non-systolic devices.
-
LoRA adapters are now functionally supported on GPU.
-
A new feature of GPU weightless blob caching enables caching model structure only and reusing the weights from the original model file. Use the new OPTIMIZE_SIZE property to activate.
-
Dynamic quantization with INT4 and INT8 precisions has been implemented and enabled by default on Intel® Core™ Ultra platforms, improving LLM first token latency.
NPU Device Plugin
-
Models retrieved from the OpenVINO cache have a smaller memory footprint now. The plugin releases the cached model (blob) after weights are loaded in NPU regions. Model export is not available in this scenario. Memory consumption is reduced during inference execution with one blob size. This optimization requires the latest NPU driver: 32.0.100.3104.
-
A driver bug for ov::intel_npu::device_total_mem_size has been fixed. The plugin will now report 2GB as the maximum allocatable memory for any driver that does not support graph extension 1.8. Even if older drivers report a larger amount of memory to be available, memory allocation would fail when 2GB are exceeded. Plugin reports the number that driver exposes for any driver that supports graph extension 1.8 (or newer).
-
A new API is used to initialize the model (available in graph extension 1.8).
-
Inference request set_tensors is now supported.
-
ov::device::LUID is now exposed on Windows.
-
LLM-related improvements have been implemented in terms of both memory usage and performance.
-
AvgPool and MaxPool operator support has been extended, adding support for more PyTorch models.
-
NOTE: for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to use larger models, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B (exceeding 4B parameters) with prompt sizes over 1024 tokens.
OpenVINO Python API
-
Constant now can be created from openvino.Tensor.
-
The “release_memory” method has been added for a compiled model, improving control over memory consumption.
OpenVINO Node.js API
-
Querying the best device to perform inference of a model with specific operations is now available in JavaScript API.
-
Contribution guidelines have been improved to make it easier for developers to contribute.
-
Testing scope has been extended by inference in end-to-end tests.
-
JavaScript API samples have been improved for readability and ease of running.
TensorFlow Framework Support
- TensorFlow 2.18.0, Keras 3.6.0, NumPy 2.0.2 in Python 3.12, and NumPy 1.26.4 in other Python versions have been added to validation.
- Out-of-the-box conversion with static ranks has been improved by devising a new shape for Switch-Merge condition sub-graphs.
- Complex type for the following operations is now supported: ExpandDims, Pack, Prod, Rsqrt, ScatterNd, Sub.
- The following issues have been fixed:
- the corner case with one element in LinSpace to avoid division by zero,
- support FP16 and FP64 input types for LeakyRelu,
- support non-i32/i64 output index type for ArgMin/Max operations.
PyTorch Framework Support
-
PyTorch version 2.5 is now supported.
-
OpenVINO Model Converter (OVC) now supports TorchScript and ExportedProgram saved on a drive.
-
The issue of aten.index.Tensor conversion for indices with “None” values has been fixed, helping to support the HF Stable Diffusion model in ExportedProgram format.
ONNX Framework Support
-
ONNX version 1.17.0 is now used.
-
Customers’ models with DequantizeLinear-21, com.microsoft.MatMulNBits, and com.microsoft.QuickGelu operations are now supported.
JAX/Flax Framework Support
-
JAX 0.4.35 and Flax 0.10.0 has been added to validation.
-
jax._src.core.ClosedJaxpr object conversion is now supported.
-
Vision Transformer from google-research/vision_transformer is now supported (with support for 37 new operations).
OpenVINO Model Server
- The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG. (read more)
- The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy. (read more)
- The following improvements have been done to LLM text generation:
- The echo sampling parameter together with logprobs in the completions endpoint is now supported.
- Performance has been increased on both CPU and GPU.
- Throughput in high-concurrency scenarios has been increased with dynamic_split_fuse for GPU.
- Testing coverage and stability has been improved.
- The procedure for service deployment and model repository preparation has been simplified.
- An experimental version of a Windows binary package - native model server for Windows OS - is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
Neural Network Compression Framework
-
A new nncf.data.generate_text_data() method has been added for generating a synthetic dataset for LLM compression. This approach helps to compress LLMs more accurately in situations when the dataset is not available or not sufficient. See our example for more information about the usage.
-
Support of data-free and data-aware weight compression methods - nncf.compress_weights() - has been extended with NF4 per-channel quantization, making compressed LLMs more accurate and faster on NPU.
-
Caching of computed statistics in nncf.compress_weights() is now available, significantly reducing compression time when performing compression of the same LLM multiple times, with different compression parameters. To enable it, set the advanced statistics_path parameter of nncf.compress_weights() to the desired file path location.
-
The backup_mode optional parameter has been added to nncf.compress_weights(), for specifying the data type for embeddings, convolutions, and last linear layers during 4-bit weight compression. Available options are INT8_ASYM (default), INT8_SYM, and NONE (retains the original floating-point precision of the model weights). In certain situations, non-default value might give better accuracy of compressed LLMs.
-
Preview support is now available for optimizing models in Torch FX format, nncf.quantize(), and nncf.compress_weights() methods. After optimization such models can be directly executed via torch.compile(compressed_model, backend=”openvino”). For more details, see INT8 quantization example.
-
Memory consumption of data-aware weight compression methods - nncf.compress_weights() – has been reduced significantly, with some variation depending on the model and method.
- Support for the following has changed:
- NumPy 2 added
- PyTorch upgraded to 2.5.1
- ONNX upgraded to 1.17
- Python 3.8 discontinued
OpenVINO Tokenizers
-
Several operations have been introduced and optimized.
-
Conversion parameters and environment info have been added to rt_info, improving reproducibility and debugging.
OpenVINO.GenAI
- The following has been added:
- LoRA adapter for the LLMPipeline.
- Text2ImagePipeline with LoRA adapter and text2image samples.
- VLMPipeline and visual_language_chat sample for text generation models with text and image inputs.
- WhisperPipeline and whisper_speech_recognition sample.
- speculative_decoding_lm has been moved to LLMPipeline based implementation and is now installed as part of the package.
- On NPU, a set of pipelines has been enabled: WhisperPipeline (for whisper-base, whisper-medium, and whisper-small), LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct, and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for best performance.
Other Changes and Known Issues
Jupyter Notebooks
Known Issues
Component: CPU Plugin
ID: 155898
Description:
Description: When using new version of Transformer version to convert some of LLMs (GPT-J/GPT-NeoX or falcon-7b), the inference accuracy may be impacted on 4th or 5th generation of Intel® Xeon® processors, due to model structure update triggering inference precision difference in part of the model. The workaround is to use transformer version of 4.44.2 or lower.
Component: GPU Plugin
ID: 154583
Description:
LLM accuracy can be low especially on non-systolic platforms like Intel® Core™ Ultra. When facing the low accuracy issue, user needs to manually set a config ACTIVATION_SCALING_FACOTR with a value of 8.0 in the compile_model() function. From the next release, scaling factor value will be automatically applied through updated IR.
Component: GenAI
ID: 156437, 148933
Description:
When using Python GenAI APIs, if ONNX 17.0 and later is installed, it may encounter the error “DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.” It is due to the ONNX dependency issue onnx/onnx#6267, Install Microsoft Visual C++ Redistributable latest supported downloads to fix the issue.
Component: GenAI
ID: 156944
Description:
There were backward incompatible changes resulting in different text generated by LLMs like Mistralai/Mistral-7B-Instruct-v0.2 and TinyLlama/TinyLlama-1.1B-Chat-v1.0 when using a tokenizer converted by older openvino_tolenizers. A way to resolve the issue is to convert tokenizer and detokenizer models using the latest openvino_tokenizers.