Model Performance Data for Intel® Gaudi® AI Accelerators

Unless noted, performance numbers are measured using the latest SynapseAI software release: version 1.16.0-526.

Note All models for training and inference use the PyTorch* v2.2.2 framework. Other applicable frameworks used for training or inference are noted for each model.

Intel® Gaudi® 2 with MLPerf* v4.0

Model #HPU Precision Performance Framework Version

Model	#HPU	Precision	Performance	Framework Version
MLPerf4.0 LLama 2 70B Server	8	fp8	6222.9 token/sec	PyTorch 2.2.2
MLPerf4.0 Llama 2 70B Offline	8	fp8	7808 token/sec	PyTorch 2.2.2
MLPerf4.0 Stable Diffusion XL Server	8	fp8	6.25 Queries/s
MLPerf4.0 Stable Diffusion XL Offline	8	fp8	6.45 samples/sec	620.15 ms

Intel Gaudi 2 Large Language Models for Throughput

Rows per page: Model #HPU Precision Input Length Output Length Throughput Batch Framework Version

Model	#HPU	Precision	Input Length	Output Length	Throughput	Batch	Framework Version
LLaMA 2 7B	1	fp8	128	128	13163 tokens/sec	1230	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	128	2048	4777 tokens/sec	163	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	128	1291 tokens/sec	94	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	2048	1943 tokens/sec	81	Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	128	128	2727 tokens/sec	1750	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	4	fp8	128	2048	7422 tokens/sec	750	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	2048	128	276 tokens/sec	95	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	2048	2048	958 tokens/sec	78	DeepSpeed 0.14.0, Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	128	13112 tokens/sec	896	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	2048	7947 tokens/sec	120	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	128	1360 tokens/sec	120	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	2048	3143 tokens/sec	44	Optimum Habana 1.11.1

Intel Gaudi 2 Large Language Models for Low Latency

Model #HPU Precision Input Length Latency Batch Framework Version

Model	#HPU	Precision	Input Length	Latency	Batch	Framework Version
LLaMA 2 7B	1	fp8	128	8.19 ms	1	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	56.97 ms	1	Optimum Habana 1.11.1
LLaMA 2 70B	8	fp8	128	24.33 ms	1	Optimum Habana 1.11.1
LLaMA 2 70B	8	fp8	2048	122 ms	1	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	10.8 ms	1	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	92 ms	1	Optimum Habana 1.11.1

Intel Gaudi 2 Reference Models

Model #HPU Precision Throughput Latency‡ Batch Framework Version

Model	#HPU	Precision	Throughput	Latency‡	Batch	Framework Version
Stable Diffusion v2.1 (512x512)**	1	bf16	1.23 img/sec	813 ms	1	Lightning 2.2.0
Stable Diffusion v2.1 (768X768)**	1	bf16	0.4 img/sec	2500 ms	1	Lightning 2.2.0
Bert FT (torch.compile)	1	bf16	806 token/sec	29.77 ms	24
Resnet50 (torch.compile)	1	bf16	17172.69 img/sec	14.9 ms	256
Resnext101	1	bf16	10670 img/sec	23.99 ms	256
Unet2D	1	bf16	7483 img/sec	8.55 ms	64	Lightning 2.2.4
Unet3D	1	bf16	128 img/sec	15.62 ms	2	Lightning 2.2.4

Hugging Face* Optimum with Intel Gaudi 2

For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub*).

Rows per page: Model #HPU Precision Input Length Output Length Throughput Latency Batch Task Framework Version

Model	#HPU	Precision	Input Length	Output Length	Throughput	Latency	Batch	Task	Framework Version
Llama 2-7B (torch.compile)	1	bf16	128	128	5820 token/sec	51.54 ms	300	text-generation	Optimum Habana 1.11.1
Falcon 180B	8	bf16	128	2048	700 token/sec	57.14 ms	40	text-generation	Optimum Habana 1.11.1
Falcon-40B 2048 Tokens	8	bf16	128	2048	92.34 token/sec	10.82 ms	1	text-generation	Optimum Habana 1.11.1
Falcon-7B 8192 Tokens	1	bf16	128	8192	118.19 token/sec	8.46 ms	1	text-generation	Optimum Habana 1.11.1
GPT-J	8	bf16	128	100	628.74 token/sec	6.36 ms	4	text-generation	Optimum Habana 1.11.1
StableLM-3B	1	bf16	128	2048	250 token/sec	4 ms	1	text-generation	Optimum Habana 1.11.1
StableLM-7B	1	bf16	128	2048	128 token/sec	7.81 ms	1	text-generation	Optimum Habana 1.11.1
MPT-7B	1	bf16	128	1932	121 token/sec	8.26 ms	1	text-generation	Optimum Habana 1.11.1
Bloomz	8	bf16	128	100	36.78 token/sec	27.18 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
StarCoder	1	bf16	100	100	65 token/sec	15.38 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
OPT	1	bf16	100	100	1120 token/sec	0.89 ms	1	text-generation	Optimum Habana 1.11.1
T5-3B Summarization 1024-128 Beam4	1	bf16	1024	128	0.94 token/sec	1063.82 ms	1	summarization	Optimum Habana 1.11.1
Bert (Text Classification)	1	bf16		128	2125 token/sec	3.76 ms	8	text-generation	Optimum Habana 1.11.1
Bert (Language Modeling)	1	bf16			66.64 token/sec	60.02 ms	4	language-modeling	Optimum Habana 1.11.1
Bert (Question Answering)	1	bf16		384	613 token/sec	13.05 ms	8	question-answering	Optimum Habana 1.11.1
StableDiffusion v2.1 (512x512)	1	bf16			1.33 images/sec	3007.51 ms	4	stable-diffusion	PyTorch Lightning 2.2.4
Bart	1	bf16			6.79 token/sec	294.55 ms	2	summarization	Optimum Habana 1.11.1
BridgeTower	1	bf16			321 token/sec	49.84 ms	16	constrastive-image-text	Optimum Habana 1.11.1
ESMFold	1	bf16			2.97 token/sec	336.7 ms	1	protein-folding	Optimum Habana 1.11.1
T5-3B Summarization Greedy	1	bf16			2.46 token/sec	406.5 ms	1	summarization	Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy	1	bf16			30.85 token/sec	129.65 ms	4	translation	Optimum Habana 1.11.1
Wav2vec(Audio Classification)	1	bf16			1002 token/sec	3.99 ms	4	audio-classification	Optimum Habana 1.11.1
Wav2vec(Speech Recoginition)	1	bf16			16.62 token/sec	240.67 ms	4	speech-recoginition	Optimum Habana 1.11.1

Intel Gaudi Reference Models

Model #HPU Precision Throughput Latency Batch Size Framework Version

Model	#HPU	Precision	Throughput	Latency	Batch Size	Framework Version
Bert	1	bf16	154.1 token/sec	155.74 ms	24
Unet2D	1	bf16	3730 img/sec	17.15 ms	64	Lightning 2.2.4
Unet3D	1	bf16	64.1 img/sec	31.2 ms	2	Lightning 2.2.4

Hugging Face Optimum Habana Gaudi

For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub).

Model #HPU Precision Throughput Latency Batch Task Framework Version

Model	#HPU	Precision	Throughput	Latency	Batch	Task	Framework Version
HF Bert (Language Modeling)	1	bf16			4	language-modeling	Optimum Habana 1.11.1
HF Bert (Question Answering)	1	bf16	127.7 token/sec	62.64 ms	8	question-answering	Optimum Habana 1.11.1
HF Bert (Text Classification)	1	bf16	434.4 token/sec	18.41 ms	8	text-classification	Optimum Habana 1.11.1
HF Bart-Greedy	1	bf16	3.1 token/sec	645.16 ms	2	summarization	Optimum Habana 1.11.1
HF ESMFold	1	bf16	13.9 token/sec	71.94 ms	1	protein-folding	Optimum Habana 1.11.1
HF StableDiffusion V2-1 (512x512)	1	bf16	0.4 token/sec	10000 ms	4	text to image generation	Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy	1	bf16	16.8 token/sec	238.09 ms	4	translation	Optimum Habana 1.11.1
HF Wav2vec(Audio Classification)	1	bf16	494.6 token/sec	8.08 ms	4	speech-recognition	Optimum Habana 1.11.1
HF Wav2vec(Speech Recoginition)	1	bf16	9.5 token/sec	421.05 ms	4	speech-recognition	Optimum Habana 1.11.1

** These models used the previous 1.15.0 software release.

† For large language inference models, this is the average next-token latency.

System Configuration

Intel® Gaudi® Platform
System: HLS-1 with eight Intel® Gaudi® platform HL-205 mezzanine cards, two Intel® Xeon® Platinum 8280 CPUs at 2.70 GHz, and 756 GB of system memory

Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory

Common Software

Ubuntu* v22.04,
Intel Gaudi software v1.16.0-526
PyTorch*: Models run with PyTorch v2.2.2 use this Docker* image.
Environment: These workloads run using the Docker images running directly on the host operating system.

For each model’s support and validation coverage, see Model-References on GitHub. All information provided there is subject to change without notice. Your costs and results may vary.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in