Intel® Distribution of OpenVINO™ Toolkit Release Notes

ID 780177
Updated 10/17/2024
Version
Public

A newer version of this document is available. Customers should click here to go to the newest version.

author-image

By

What’s new

 

More Gen AI coverage and framework integrations to minimize code changes.

 

  • Llama 3 optimizations for CPUs, built-in GPUs, and discrete GPUs for improved performance and efficient memory usage.

  • Support for Phi-3-mini, a family of AI models that leverages the power of small language models for faster, more accurate and cost-effective text processing.

  • Python Custom Operation is now enabled in OpenVINO making it easier for Python developers to code their custom operations instead of using C++ custom operations (also supported). Python Custom Operation empowers users to implement their own specialized operations into any model.

  • Notebooks expansion to ensure better coverage for new models. Noteworthy notebooks added: DynamiCrafter, YOLOv10, Chatbot notebook with Phi-3, and QWEN2.
     

Broader Large Language Model (LLM) support and more model compression techniques.

 

  • GPTQ method for 4-bit weight compression added to NNCF for more efficient inference and improved performance of compressed LLMs.

  • Significant LLM performance improvements and reduced latency for both built-in GPUs and discrete GPUs.

  • Significant improvement in 2nd token latency and memory footprint of FP16 weight LLMs on AVX2 (13th Gen Intel® Core™ processors) and AVX512 (3rd Gen Intel® Xeon® Scalable Processors) based CPU platforms, particularly for small batch sizes.
     

More portability and performance to run AI at the edge, in the cloud, or locally.

 

  • Model Serving Enhancements:

    • Preview: OpenVINO Model Server (OVMS) now supports OpenAI-compatible API along with Continuous Batching and PagedAttention, enabling significantly higher throughput for parallel inferencing, especially on Intel® Xeon® processors, when serving LLMs to many concurrent users.

    • OpenVINO backend for Triton Server now supports built-in GPUs and discrete GPUs, in addition to dynamic shapes support.

    • Integration of TorchServe through torch.compile OpenVINO backend for easy model deployment, provisioning to multiple instances, model versioning, and maintenance.

  • Preview: addition of the Generate API, a simplified API for text generation using large language models with only a few lines of code. The API is available through the newly launched OpenVINO GenAI package.

  • Support for Intel Atom® Processor X Series. For more details, see System Requirements.

  • Preview: Support for Intel® Xeon® 6 processor.

OpenVINO™ Runtime

Common

  • Operations and data types using UINT2, UINT3, and UINT6 are now supported, to allow for a more efficient LLM weight compression.

  • Common OV headers have been optimized, improving binary compilation time and reducing binary size.

AUTO Inference Mode

  • AUTO takes model caching into account when choosing the device for fast first-inference latency. If model cache is already in place, AUTO will directly use the selected device instead of temporarily leveraging CPU as first-inference device.

  • Dynamic models are now loaded to the selected device, instead of loading to CPU without considering device priority.

  • Fixed the exceptions when use AUTO with stateful models having dynamic input or output.

CPU Device Plugin

  • Performance when using latency mode in FP32 precision has been improved on Intel client platforms, including Core Ultra (codename Meteor Lake) and 13th Gen Core processors (codename Raptor Lake).

  • 2nd token latency and memory footprint for FP16 LLMs have been improved significantly on AVX2 and AVX512 based CPU platforms, particularly for small batch sizes.

  • PagedAttention has been optimized on AVX2, AVX512 and AMX platforms together with INT8 KV cache support to improve the performance when serving LLM workloads on Intel CPUs.

  • LLMs with shared embeddings have been optimized to improve performance and memory consumption on several models including Gemma.

  • Performance on ARM-based servers is significantly improved with upgrade to TBB 2021.2.5.

  • Improved FP32 and FP16 performance on ARM CPU.

GPU Device Plugin

  • Both first token and average token latency of LLMs is improved on all GPU platforms, most significantly on discrete GPUs. Memory usage of LLMs has been reduced as well.

  • Stable Diffusion FP16 performance improved on Core Ultra platforms, with significant pipeline improvement for models with dynamic-shaped input. Memory usage of the pipeline has been reduced, as well.

  • Optimized permute_f_y kernel performance has been improved.

NPU Device Plugin

  • A new set of configuration options is now available.

  • Performance increase has been unlocked, with the new 2408 NPU driver.

OpenVINO Python API

  • Writing custom Python operators is now supported for basic scenarios (alignment with OpenVINO C++ API.) This empowers users to implement their own specialized operations into any model. Full support with more advanced features is within the scope of upcoming releases.

OpenVINO C API

  • More element types are now supported to algin with the OpenVINO C++ API.

OpenVINO Node.js API

  • OpenVINO node.js packages now support the electron.js framework.

  • Extended and improved JS API documentation for more complete usage guidelines.

  • Better JS API alignment with OpenVINO C++ API, delivering more advanced features to JS users.

TensorFlow Framework Support

  • 3 new operations are now supported. See operations marked as NEW here.

  • LookupTableImport has received better support, required for 2 models from TF Hub:

    • mil-nce

    • openimages-v4-ssd-mobilenet-v2

TensorFlow Lite Framework Support

  • The GELU operation required for customer model is now supported.

PyTorch Framework Support

  • 9 new operations are now supported.

  • aten::set_item now supports negative indices.

  • Issue with adaptive pool when shape is list has been fixed (PR #24586).

ONNX Support

  • The InputModel interface should be used from now on, instead of a number of deprecated APIs and class symbols.

  • Translation for ReduceMin-18 and ReduceSumSquare-18 operators has been added, to address customer model requests.

  • Behavior of the Gelu-20 operator has been fixed for the case when “none” is set as the default value.

OpenVINO Model Server

  • OpenVINO Model server can be now used for text generation use cases using OpenAI compatible API.

  • Added support for continuous batching and PagedAttention algorithms for text generation with fast and efficient in high concurrency load especially on Intel Xeon processors. Learn more about it.

Neural Network Compression Framework

  • GPTQ method is now supported in nncf.compress_weights() for data-aware 4-bit weight compression of LLMs. Enabled by gptq=True` in nncf.compress_weights().

  • Scale Estimation algorithm for more accurate 4-bit compressed LLMs. Enabled by scale_estimation=True` in nncf.compress_weights().

  • Added support for models with bf16 weights in nncf.compress_weights().

  • nncf.quantize() method is now the recommended path for quantization initialization of PyTorch models in Quantization-Aware Training. See example for more details.

  • compressed_model.nncf.get_config() and nncf.torch.load_from_config() API have been added to save and restore quantized PyTorch models. See example for more details.

  • Automatic support for int8 quantization of PyTorch models with custom modules has been added. Now it is not needed to register such modules before quantization.

Other Changes and Known Issues

Jupyter Notebooks

Known Issues

Component: TBB

ID: TBB-1400/ TBB-1401

Description:

In 2024.2, oneTBB 2021.2.x is used for Intel Distribution of OpenVINO Ubuntu and Red Hat archives, instead of system TBB/oneTBB. This improves performance on the new generation Intel® Xeon® platforms but may increase latency of some models on the previous genration. You can build OpenVINO with -DSYSTEM_TBB=ON to get better latency performance for these models.

Component: python API

ID: CVS-141744

Description:

During post commit tests we found problem related with custom operations. Fix is ready and will be delivered with 2024.3 release.

- Initial problem: test_custom_op hanged on destruction because it was waiting for a thread which tried to acquire GIL.

- The second problem is that pybind11 doesn’t allow to work with GIL besides of current scope and it’s impossible to release GIL for destructors. Blocking destructors and the GIL pybind/pybind11#1446

- Current solution allows to release GIL for InferRequest and all called by chain destructors.

Component: CPU runtime

ID: MFDNN-11428

Description:

Due to adopting a new OneDNN library, improving performance for most use cases, particularly for AVX2 BRGEMM kernels with the latency hint, the following regressions may be noticed:

a. latency regression on certain models, such as unet-camvid-onnx-0001 and mask_rcnn_resnet50_atrous_coco on MTL Windows latency mode

b. performance regression on Intel client platforms if the throughput hint is used

The issue is being investigated and planned to be resolved in the following releases.

Component: Hardware Configuration

ID: N/A

Description:

Reduced performance for LLMs may be observed on newer CPUs. To mitigate, modify the default settings in BIOS to change the system into 2 NUMA node system:

1. Enter the BIOS configuration menu.

2. Select EDKII Menu -> Socket Configuration -> Uncore Configuration -> Uncore General Configuration -> SNC.

3. The SNC setting is set to AUTO by default. Change the SNC setting to disabled to configure one NUMA node per processor socket upon boot.

4. After system reboot, confirm the NUMA node setting using: numatcl -H. Expect to see only nodes 0 and 1 on a 2-socket system with the following mapping:

Node - 0 - 1

0 - 10 - 21

1 - 21 - 10

Deprecation And Support

Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to the OpenVINO Legacy Features and Components page.

Discontinued in 2024

  • Runtime components:

    • Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond.

    • OpenVINO C++/C/Python 1.0 APIs (see 2023.3 API transition guide for reference).

    • All ONNX Frontend legacy API (known as ONNX_IMPORTER_API).

    • PerfomanceMode.UNDEFINED property as part of the OpenVINO Python API.

  • Tools:

Deprecated and to be removed in the future

  • The OpenVINO™ Development Tools package (pip install openvino-dev) will be removed from installation options and distribution channels beginning with OpenVINO 2025.

  • Model Optimizer will be discontinued with OpenVINO 2025.0. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.

  • OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. It will be replaced with CPU binding configurations (ov::hint::enable_cpu_pinning).

  • OpenVINO Model Server components:

    • “auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead.

  • A number of notebooks have been deprecated. For an up-to-date listing of available notebooks, refer to the OpenVINO™ Notebook index (openvinotoolkit.github.io).

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein.

You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Atom, Arria, Core, Movidius, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Other names and brands may be claimed as the property of others.

Copyright © 2024, Intel Corporation. All rights reserved.

For more complete information about compiler optimizations, see our Optimization Notice.