Models

Model Performance Data for Intel® Gaudi® 3 AI Accelerators

These performance numbers are measured using the latest Intel® Gaudi® software release version 1.20, unless otherwise noted.

Note All models for both training and inference are using the PyTorch* 2.6.0 framework. Other applicable frameworks used for training or inference are noted for each model.

Explore Intel® Gaudi® 2 Accelerator Performance Data

INFERENCE

Large Language Models (LLM) for Throughput with Intel® Gaudi® 3 Accelerators

The measurements in the following table are based on the Optimum-Habana for Intel Gaudi 3 accelerators with Hugging Face* model. To reproduce the results, follow these steps:

Setup Instructions
Tensor Quantization Statistics Measurement
Quantize and Run the FP8 Model

Model Precision Input Length Output Length HPU Batch Size Throughput

Model	Precision	Input Length	Output Length	#HPU	Batch Size	Throughput (tokens/sec)
LLaMA 2 70b	fp8	128	128	2	1750	4853
LLaMA 2 70b	fp8	128	2048	2	512	6835
LLaMA 2 70b	fp8	2048	128	2	242	506
LLaMA 2 70b	fp8	2048	2048	2	241	2859
LLaMA 3.1 8B	fp8	128	128	1	1536	25097
LLaMA 3.1 8B	fp8	128	2048	1	768	20425
LLaMA 3.1 8B	fp8	2048	128	1	256	2765
LLaMA 3.1 8B	fp8	2048	2048	1	350	9013
LLaMA 3.1 70B	fp8	128	128	2	2048	5466
LLaMA 3.1 70B	fp8	128	2048	2	450	6535
LLaMA 3.1 70B	fp8	2048	128	2	223	663
LLaMA 3.1 70B	fp8	2048	2048	2	175	2891
LLaMA 3.1 70B	fp8	128	128	8	4000	18290
LLaMA 3.1 70B	fp8	128	2048	8	768	21138
LLaMA 3.1 70B	fp8	2048	128	8	512	2273
LLaMA 3.1 70B	fp8	2048	2048	8	600	10600
LLaMA 3.3 70B	fp8	128	128	8	3986	16622
LLaMA 3.3 70B	fp8	128	2048	8	2048	24705
LLaMA 3.3 70B	fp8	2048	128	8	600	1890
LLaMA 3.3 70B	fp8	2048	2048	8	650	11043
LLaMA 3.1 405B	fp8	128	128	8	2996	3488
LLaMA 3.1 405B	fp8	128	2048	8	460	4998
LLaMA 3.1 405B	fp8	2048	128	8	195	394
LLaMA 3.1 405B	fp8	2048	2048	8	180	2238

Setup Instructions

Please make sure to follow Driver Installation to install the Gaudi driver on the system.
It is recommended to use the PyTorch Docker image to run the examples below.

To use the provided Dockerfile for the sample, follow the Docker Installation guide to setup the Habana runtime for Docker images.
The Docker image assists in setting up the PyTorch software and packages to run the samples. However, installing additional required packages like DeepSpeed is still necessary to run the samples.

Get examples from optimum-habana github repository

To benchmark Llama2 and Llama3 models, obtain optimum-habana from the GitHub repository using the following command.


git clone -b v1.16.0 https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/text-generation

Docker Run

After building the Docker image, run the following command to start a Docker instance, which will open in the text-generation folder inside the docker instance.


docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none   --cap-add=ALL --privileged=true  --net=host --ipc=host  -v "$PWD/../../":/workspace --workdir  /workspace/examples/text-generation  vault.habana.ai/gaudi-docker/1.20.0/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest

NOTE: The Huggingface model file size might be large, so it is recommended to use an external disk as the Huggingface hub folder. Export the HF_HOME environment variable to the external disk and then export the mount point into the Docker instance. ex: "-e HF_HOME=/mnt/huggingface -v /mnt:/mnt"

Install required packages inside docker

First, install the optimum-habana:


pip install --upgrade-strategy eager optimum[habana]

Second, install the requirements:


pip install -r requirements.txt

For run_lm_eval.py:


pip install -r requirements_lm_eval.txt

Then, to use DeepSpeed-inference, install DeepSpeed as follows:


pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.20.0

Tensor quantization statisics measurement

This step needs to be completed only once for each model with the corresponding world size values.
The hqtoutput generated after this step will be used for the FP8 run. If changing models for the FP8 run, repeat this step to obtain the relevant hqtoutput.

Here is an example to measure the tensor quantization statistics for LLama2 or 3 models:
{:.note}Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.

Export different values to the following environment variables to change parameters for tensor quantization statistics:

Environment Variable	Values
model_name	meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf, meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, meta-llama/Llama-3.3-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct
world_size	1, 2, 8


export model_name=meta-llama/Llama-2-70b-hf
export world_size=2


HF_DATASETS_TRUST_REMOTE_CODE=true QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size} run_lm_eval.py \
-o acc_llama_quant.json \
--model_name_or_path ${model_name} \
--warmup 0 \
--use_hpu_graphs \
--use_kv_cache \
--trim_logits \
--batch_size 1 \
--bucket_size=128 \
--bucket_internal \
--trust_remote_code \
--tasks hellaswag lambada_openai piqa winogrande \
--bf16 \
--attn_softmax_bf16 \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask

Quantize and run the fp8 model

Here is an example to quantize the model based on previous measurements for LLama2 or 3 models:

Export different values to the following environment variables to change parameters for tensor quantization statistics:

Environment Variable	Values
model_name	meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf, meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, meta-llama/Llama-3.3-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct
input_len	128, 2048, and etc
output_len	128, 2048, and etc
batch_size	350, 1512, 1750, and etc
world_size	1, 2, 8

Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.

Here is an example to run llama2-70b with input tokens length=128, output tokens length=128 and batch size = 1750


export model_name=meta-llama/Llama-2-70b-hf
export input_len=128
export output_len=128
export batch_size=1750
export world_size=2

After setting the environment variables, run the FP8 model using the following command:


HF_DATASETS_TRUST_REMOTE_CODE=true QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size} run_generation.py \
--model_name_or_path ${model_name} \
--attn_softmax_bf16 \
--use_hpu_graphs \
--limit_hpu_graphs \
--trim_logits \
--use_kv_cache \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask  \
--bucket_size=128 \
--bucket_internal \
--attn_batch_split 2  \
--bf16 \
--batch_size ${batch_size} \
--max_new_tokens ${output_len} \
--max_input_tokens ${input_len} \
--warmup 2

Please note that Llama3-405B requires --book_source additionally to achieve better performance. Llama3.3-70B model also doesn't require the "--attn_batch_split 2" argument.

System Configuration

Intel Gaudi 3 Platform

System: HLS-Gaudi3 with eight Intel Gaudi 3 platform HL-325L mezzanine cards, two Intel® Xeon® Platinum 8480+ CPUs at 2.0 GHz, and 1 TB of system memory.

Common Software

Ubuntu* v22.04
Intel Gaudi software v1.20.0 (full software support details)
PyTorch: Models run with PyTorch v2.6.0

Performance Data

To view Intel Gaudi 3 accelerator 1.19.0 performance data, see Model Performance.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in