Develop Solutions on Intel® Gaudi® AI Accelerators

Run an Inference Use Case on the Intel® Gaudi® 2 AI Accelerator

Learn how to select a model, set up the environment, run the workload, and then see a price-performance comparison. The accelerator supports PyTorch* as the main framework for inference.

The following code guides you in how to:

Get access to a node for Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
Ensure that all the software is installed and configured properly by running the PyTorch* version of the Docker* image for the accelerator
Select the model to run by loading the desired model repository and appropriate libraries for model acceleration.

Run the model and extract the details for evaluation.

There are four methods for running inference on models:

Using Hugging Face* models with the Optimum for Intel Gaudi AI accelerators library.
Using the Intel Gaudi AI accelerators model references repository to use built-in PyTorch models.
Using the GPU Migration toolkit to automatically convert GPU-based models to be compatible with Intel Gaudi AI accelerators.
Manual migration from PyTorch models in the public domain.

The Optimum for Intel Gaudi AI accelerators library and the model-reference repository contain fully optimized and fully documented model examples. Use them as a starting point for running a model.

This example shows model inference with Hugging Face by running the Meta* Llama-2-70b model using the Optimum for Intel Gaudi AI accelerators library. Since Hugging Face models are used with an associated task, run inference with the text-generation task.

Performance Evaluation

Before running the model, let's look at the performance measurements and a price-performance comparison to an equivalent H100 inference. In this case, the Llama-2-70b parameter model was selected using FP8, with 128 input tokens, 2048 output tokens, and four Intel Gaudi AI accelerators.

The tokens per dollar or inference runs per dollar are significantly higher than the NVIDIA solution.

View the Intel Gaudi benchmarks and performance data.
The model is compared to the same model configuration using the H100 GPU using NVIDIA* published inference benchmarks from June 25, 2024.

Performance cost differences

Setup Instructions
Run and Fine-Tune

Runtime Instructions

The following are the run instructions needed to set up the node, the model infrastructure and the full runtimes for the model.

Accessing the Intel Gaudi Node

To access an Intel Gaudi node in the Intel Tiber AI cloud, go to Intel Tiber AI Cloud Console, access the hardware instances to select the Intel Gaudi 2 platform for deep learning and follow the steps to start and connect to the node.

The website will provide you an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command: ssh -L 8888:localhost:8888 .. to be able to access the Notebook.

Docker* Setup

Now that you have access to the node, you will use the latest Docker image by first calling the docker run command which will automatically download and run the docker:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

We then start the docker and enter the docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

Model Setup

Now that we’re running in a docker environment, we can now install the remaining libraries and model repositories:

Start in the root directory and install the DeepSpeed* Library; DeepSpeed is used to improve memory consumption on Intel Gaudi while running large language models.

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0

Now install the Hugging Face Optimum Intel Gaudi library and GitHub examples. We’re selecting the latest validated release of optimum-habana:

pip install optimum-habana==1.14.1
git clone -b v1.14.1 https://github.com/huggingface/optimum-habana

Then, we transition to the text-generation example and install the final set of requirements to run the model:

cd ~/optimum-habana/examples/text-generation
pip install -r requirements.txt
pip install -r requirements_lm_eval.txt

How to Access and Use the Llama 2 Model

Use of the pre-trained model is subject to compliance with third-party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses.

To be able to run gated models like this Llama-2-70b-hf, you need the following:

Have a Hugging Face account and agree to the terms of use of the model in its model card on the Hugging Face hub
Create a read token and request access to the Llama 2 model from meta-llama
Login to your account using the Hugging Face CLI:

huggingface-cli login --token <your_hugging_face_token_here>

If you want to run with the associated Jupyter Notebook for inference, please see the running and fine-tuning addendum section for setup of the Jupyter Notebook and you can run these steps directly in the Jupyter interface.

Intel® Tiber® AI Cloud

Text-generation example on GitHub

Pytorch Inference Jupyter Notebook

Running the Llama 2 70B Model Using the FP8 Datatype

Note To learn more about Intel Gaudi FP8 quantization, see the user guide.

Run a quantization measurement. This is provided by running the local quantization tool using the maxabs_measure.json file that is already loaded on the Hugging Face library on GitHub.

QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
-o acc_70b_bs1_measure4.txt \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--bf16 \
--batch_size 1 \
--use_flash_attention \
--flash_attention_recompute

The model will ask permission to run custom code associated with dataset loading. If this is acceptable, answer yes and execution will proceed.

The code generates a set of measurement values in an hqt_output folder that shows what operations were converted to the FP8 datatype.

\-rw\-r--r-- 1 root root 347867 Jul 13 07:52 measure_hooks_maxabs_0_4.json
\-rw\-r--r-- 1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_0_4.npz
\-rw\-r--r-- 1 root root 40297 Jul 13 07:52 measure_hooks_maxabs_0_4_mod_list.json
\-rw\-r--r-- 1 root root 347892 Jul 13 07:52 measure_hooks_maxabs_1_4.json
\-rw\-r--r-- 1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_1_4.npz
\-rw\-r--r-- 1 root root 40297 Jul 13 07:52 measure_hooks_maxabs_1_4_mod_list.json
\-rw\-r--r-- 1 root root 347903 Jul 13 07:52 measure_hooks_maxabs_2_4.json
\-rw\-r--r-- 1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_2_4.npz
\-rw\-r--r-- 1 root root 40297 Jul 13 07:52 measure_hooks_maxabs_2_4_mod_list.json
\-rw\-r--r-- 1 root root 347880 Jul 13 07:52 measure_hooks_maxabs_3_4.json
\-rw\-r--r-- 1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_3_4.npz
\-rw\-r--r-- 1 root root 40297 Jul 13 07:52 measure_hooks_maxabs_3_4_mod_list.json

You can use these measurements to run the throughput running of the model. In this case, a standard input prompt is used:

QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--max_new_tokens 2048 \
--max_input_tokens 128 \
--bf16 \
--batch_size 750 \
--use_flash_attention \
--flash_attention_recompute

Notice that the quantization .json config file is used instead of the measurement file and additional input and output parameters are added. In this case, --max_new_tokens 2048 appears, which determines the size of the output text generated and --max_input_tokens 128 appears, which defines the size of the number of input tokens.

You can now see the final values that align with the published numbers.

Stats: 
----------------------------------------------------------------------------
Throughput (including tokenization) = 7429.9682811444545 tokens/second
Number of HPU graphs                = 1554
Memory allocated                    = 20.64 GB
Max memory allocated                = 94.11 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 872.1275488010142 seconds
----------------------------------------------------------------------------------

Next Steps

Now that you have run a full inference case, you can go back to the Hugging Face Optimum Intel Gaudi validated models to see more options for running inference.

Addendum – Jupyter Notebook

Setting up Jupyter Notebook with Intel Gaudi processor. To setup the Jupyter Notebook, first be sure that you add a local port forwarding to the ssh command to be able to access a local Jupyter Notebook. For example, add the command: ssh -L 8888:localhost:8888 .. to be able to access the Notebook. With this example below you see that the standard login is ssh -J guest@146.152.232.8 ubuntu@100.80.239.52, but with the port forwarding you will change this to:

ssh -L 8888:localhost:8888 -J guest@146.152.232.8 ubuntu@100.80.239.52

Now that you have logged into the system, follow the same steps as listed above to load and run the Docker image, install the libraries and run the model setup. You can then run the following command to install and run the Jupyter Notebook:

python3 -m pip install jupyterlab
python3 -m jupyterlab_server --IdentityProvider.token=''  --ServerApp.password=''  --allow-root  --port 8888  --ServerApp.root_dir=/root &

Open an internet browser at: http://127.0.0.1:8888/lab and use the left navigation to click from /Gaudi-tutorials/PyTorch/Inference/IntelGaudiInference.ipynb, and you will then see the Jupyter Notebook as shown: This is your Alt Text

Intel® Tiber® AI Cloud

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in