Intel® Distribution of OpenVINO™ Toolkit Release Notes

ID 780177
Updated 7/31/2024
Version
Public

author-image

By

What’s new

More Gen AI coverage and framework integrations to minimize code changes.

  • OpenVINO™ pre-optimized models are now available in Hugging Face making it easier for developers to get started with these models.   

Broader Large Language Model (LLM) support and more model compression techniques. 

  • Significant improvement in LLM performance on Intel built-in and discrete GPUs with the addition of dynamic quantization, Multi-Head Attention (MHA), and OneDNN enhancements.   

More portability and performance to run AI at the edge, in the cloud, or locally.  

  • Improved CPU performance when serving LLMs with the inclusion of vLLM and continuous batching in the OpenVINO™ Model Server (OVMS). vLLM is an easy-to-use open-source library that supports efficient LLM inferencing and model serving. 

OpenVINO™ Runtime

Common

  • OpenVINO may now be used as a backend for vLLM, offering better CPU performance due to fully-connected layer optimization, fusing multiple fully-connected layers (MLP), U8 KV cache, and dynamic split fuse. 
  • The following have been improved: 
    • Support for models like YoloV10 or PixArt-XL-2, thanks to enabling Squeeze and Concat layers. 
    • Performance of precision conversion fp16/bf16 à fp32.  

AUTO Inference Mode

  • Model cache is now disabled for CPU acceleration even when cache_dir is set, because CPU acceleration is skipped when the cached model is ready for the target device in the 2nd run. 

Heterogeneous Inference Mode 

  • PIPELINE_PARALLEL policy is now available, to inference large models on multiple devices per available memory size, being especially useful for large language models that don’t fit into one discrete GPU (a preview feature). 

CPU Device Plugin

  • Fully Connected layers have been optimized together with RoPE optimization with JIT kernel to improve performance for LLM serving workloads on Intel AMX platforms.   
  • Dynamic quantization of Fully Connected layers is now enabled by default on Intel AVX2 and AVX512 platforms, improving out-of-the-box performance for 8bit/4bit weight-compressed LLMs. 
  • Performance has been improved for: 
    • ARM server configuration, due to migration to Intel® oneAPI Threading Building Blocks 2021.13. 
    • ARM for FP32 and FP16. 

GPU Device Plugin

  • Performance has been improved for: 
    • LLMs and Stable Diffusion on discrete GPUs, due to latency decrease, through optimizations such as Multi-Head Attention (MHA) and oneDNN improvements. 
    • First token latency of LLMs for large input cases on Core Ultra integrated GPU. It can be further improved with dynamic quantization enabled with an application interface
    • Whisper models on discrete GPU. 

NPU Device Plugin

  • GenAI API now supports NPU LLMs (preview feature). To support LLMs on NPU (requires the most recent version of the NPU driver), additional relevant features are also part of the NPU plugin now. 
  • Memory optimizations: removed copying of weights from NPU compiler adapter, improving both memory and first ever inference latency for model inference on NPU.  
  • Added support for the models bugger than 2GB on both NPU driver and NPU plugin side (both Linux and Windows). 

OpenVINO Python API

  • visit_attributes is now available in custom operation implemented in Python, so you may pass a dictionary of attributes, i.e. {"name1": value1, "name2": value2...}, instead of multiple on_attribute methods (as in C++). 
  • ReadValue or NodeFactory can now be used to benefit different use cases, for reduced code complexity. 
  • Kwargs overloading is now supported. 

OpenVINO C API (N/A for 24.3) 

OpenVINO Node.js API 

  • Tokenizers and StringTensor are now supported for LLM inference. 
  • Compatibility with electron.js is now restored for desktop application developers. 
  • Async version of Core.import_model and enhancements for Core.read_model methods are now available, for more efficient model reading, especially for LLMs.  

ONNX Framework Support 

  • More models are now supported: 
    • Models using the new version of the ReduceMean operation (introduced in ONNX opset 18). 
    • Models using the Multinomial operation (introduced in ONNX opset 7). 

TensorFlow Framework Support

  • Improved performance of models with keras.LSTM operations on CPU 
  • Supported a tensor list initialized with undefined element shape value 
  • Added support for 3 NEW* operations: HSVToRGB, AdjustHue, AdjustSaturation 

TensorFlow Lite Framework Support

  •  Constants containing spare tensors are now supported.  

PyTorch Framework Support

  • Setting types/shapes for nested structures (e.g., dictionaries and tuples) is now supported. 
  • The aten::layer_norm has been updated to support dynamic shape normalization. 
  • Dynamic shapes support in the FX graph has been improved, benefiting torch.compile and torch.export based applications, improving performance for gemma and chatglm model families. 

OpenVINO Model Server

  • The following has been improved in OpenAI API text generation: 

    • Performance results, due to OpenVINO Runtime and sampling algorithms. 

    • Reporting generation engine metrics in the logs. 

    • Extra sampling parameters added. 

    • Request parameters affecting memory consumption now have value restrictions, within a configurable range. 
       

  • The following has been fixed in OpenAI API text generation: 
    • Generating streamer responses impacting incomplete utf-8 sequences. 

    • A sporadic generation hang. 

    • Incompatibility of the last response from the ``completions`` endpoint stream with the vLLM benchmarking script. 

Neural Network Compression Framework

  • The MXFP4 data format is now supported in the Weight Compression method, compressing weights to 4-bit with the e2m1 data type without a zero point and with 8-bit e8m0 scales. This feature is enabled by setting ‘mode=CompressWeightsMode.E2M1’ in nncf.compress_weights(). 
  • The AWQ algorithm in the Weight Compression method has been extended for patterns: Act->MatMul and Act->MUltiply->MatMul to cover the Phi family models. 
  • The representation of symmetrically quantized weights has been updated to a signed data type with no zero point. This allows NPU to support compressed LLMs with the symmetric mode. 
  • bf16 models in Post-Training Quantization are now supported; nncf.quantize(). 

OpenVINO Tokenizers 

  • The following is now supported: 
    • Full Regex syntax with the PCRE2 library for text normalization. 
    • Left padding side for all tokenizer types. 
  • GLM-4 tokenizer support, as well as detokenization support for Phi-3 and Gemma have been improved. 

Other Changes and Known Issues

Jupyter Notebooks

  • Stable Audio 
  • Phi-3-vision 

OpenVINO.GenAI 

  • Performance counters have been added. 
  • Preview support for NPU is now available. 

Hugging Face  

  • OpenVINO pre-optimized models are now available on Hugging Face: 

Deprecation And Support

Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to the OpenVINO Legacy Features and Components page. 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. 

You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. 

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. 

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. 

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.