Model Performance Data for Intel® Gaudi® AI Accelerators
Unless noted, performance numbers are measured using the latest SynapseAI software release: version 1.16.0-526.
Note All models for training and inference use the PyTorch* v2.2.2 framework. Other applicable frameworks used for training or inference are noted for each model.
Intel® Gaudi® 2 with MLPerf* v4.0
Model | #HPU | Precision | Performance | Framework Version |
---|---|---|---|---|
MLPerf4.0 LLama 2 70B Server | 8 | fp8 | 6222.9 token/sec | PyTorch 2.2.2 |
MLPerf4.0 Llama 2 70B Offline | 8 | fp8 | 7808 token/sec | PyTorch 2.2.2 |
MLPerf4.0 Stable Diffusion XL Server | 8 | fp8 | 6.25 Queries/s | |
MLPerf4.0 Stable Diffusion XL Offline | 8 | fp8 | 6.45 samples/sec | 620.15 ms |
Intel Gaudi 2 Large Language Models for Throughput
Model | #HPU | Precision | Input Length | Output Length | Throughput | Batch | Framework Version |
---|---|---|---|---|---|---|---|
LLaMA 2 7B | 1 | fp8 | 128 | 128 | 13163 tokens/sec | 1230 | Optimum Habana 1.11.1 |
LLaMA 2 7B | 1 | fp8 | 128 | 2048 | 4777 tokens/sec | 163 | Optimum Habana 1.11.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 128 | 1291 tokens/sec | 94 | Optimum Habana 1.11.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 2048 | 1943 tokens/sec | 81 | Optimum Habana 1.11.1 |
LLaMA 2 70B | 2 | fp8 | 128 | 128 | 2727 tokens/sec | 1750 | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
LLaMA 2 70B | 4 | fp8 | 128 | 2048 | 7422 tokens/sec | 750 | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
LLaMA 2 70B | 2 | fp8 | 2048 | 128 | 276 tokens/sec | 95 | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
LLaMA 2 70B | 2 | fp8 | 2048 | 2048 | 958 tokens/sec | 78 | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 128 | 128 | 13112 tokens/sec | 896 | Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 128 | 2048 | 7947 tokens/sec | 120 | Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 2048 | 128 | 1360 tokens/sec | 120 | Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 2048 | 2048 | 3143 tokens/sec | 44 | Optimum Habana 1.11.1 |
Intel Gaudi 2 Large Language Models for Low Latency
Model | #HPU | Precision | Input Length | Latency | Batch | Framework Version |
---|---|---|---|---|---|---|
LLaMA 2 7B | 1 | fp8 | 128 | 8.19 ms | 1 | Optimum Habana 1.11.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 56.97 ms | 1 | Optimum Habana 1.11.1 |
LLaMA 2 70B | 8 | fp8 | 128 | 24.33 ms | 1 | Optimum Habana 1.11.1 |
LLaMA 2 70B | 8 | fp8 | 2048 | 122 ms | 1 | Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 128 | 10.8 ms | 1 | Optimum Habana 1.11.1 |
Mistral 7B Instruct | 1 | fp8 | 2048 | 92 ms | 1 | Optimum Habana 1.11.1 |
Intel Gaudi 2 Reference Models
Model | #HPU | Precision | Throughput | Latency‡ | Batch | Framework Version |
---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512)** | 1 | bf16 | 1.23 img/sec | 813 ms | 1 | Lightning 2.2.0 |
Stable Diffusion v2.1 (768X768)** | 1 | bf16 | 0.4 img/sec | 2500 ms | 1 | Lightning 2.2.0 |
Bert FT (torch.compile) | 1 | bf16 | 806 token/sec | 29.77 ms | 24 | |
Resnet50 (torch.compile) | 1 | bf16 | 17172.69 img/sec | 14.9 ms | 256 | |
Resnext101 | 1 | bf16 | 10670 img/sec | 23.99 ms | 256 | |
Unet2D | 1 | bf16 | 7483 img/sec | 8.55 ms | 64 | Lightning 2.2.4 |
Unet3D | 1 | bf16 | 128 img/sec | 15.62 ms | 2 | Lightning 2.2.4 |
Hugging Face* Optimum with Intel Gaudi 2
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub*).
Model | #HPU | Precision | Input Length | Output Length | Throughput | Latency | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|---|---|
Llama 2-7B (torch.compile) | 1 | bf16 | 128 | 128 | 5820 token/sec | 51.54 ms | 300 | text-generation | Optimum Habana 1.11.1 |
Falcon 180B | 8 | bf16 | 128 | 2048 | 700 token/sec | 57.14 ms | 40 | text-generation | Optimum Habana 1.11.1 |
Falcon-40B 2048 Tokens | 8 | bf16 | 128 | 2048 | 92.34 token/sec | 10.82 ms | 1 | text-generation | Optimum Habana 1.11.1 |
Falcon-7B 8192 Tokens | 1 | bf16 | 128 | 8192 | 118.19 token/sec | 8.46 ms | 1 | text-generation | Optimum Habana 1.11.1 |
GPT-J | 8 | bf16 | 128 | 100 | 628.74 token/sec | 6.36 ms | 4 | text-generation | Optimum Habana 1.11.1 |
StableLM-3B | 1 | bf16 | 128 | 2048 | 250 token/sec | 4 ms | 1 | text-generation | Optimum Habana 1.11.1 |
StableLM-7B | 1 | bf16 | 128 | 2048 | 128 token/sec | 7.81 ms | 1 | text-generation | Optimum Habana 1.11.1 |
MPT-7B | 1 | bf16 | 128 | 1932 | 121 token/sec | 8.26 ms | 1 | text-generation | Optimum Habana 1.11.1 |
Bloomz | 8 | bf16 | 128 | 100 | 36.78 token/sec | 27.18 ms | 1 | text-generation | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
StarCoder | 1 | bf16 | 100 | 100 | 65 token/sec | 15.38 ms | 1 | text-generation | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
OPT | 1 | bf16 | 100 | 100 | 1120 token/sec | 0.89 ms | 1 | text-generation | Optimum Habana 1.11.1 |
T5-3B Summarization 1024-128 Beam4 | 1 | bf16 | 1024 | 128 | 0.94 token/sec | 1063.82 ms | 1 | summarization | Optimum Habana 1.11.1 |
Bert (Text Classification) | 1 | bf16 | 128 | 2125 token/sec | 3.76 ms | 8 | text-generation | Optimum Habana 1.11.1 | |
Bert (Language Modeling) | 1 | bf16 | 66.64 token/sec | 60.02 ms | 4 | language-modeling | Optimum Habana 1.11.1 | ||
Bert (Question Answering) | 1 | bf16 | 384 | 613 token/sec | 13.05 ms | 8 | question-answering | Optimum Habana 1.11.1 | |
StableDiffusion v2.1 (512x512) | 1 | bf16 | 1.33 images/sec | 3007.51 ms | 4 | stable-diffusion | PyTorch Lightning 2.2.4 | ||
Bart | 1 | bf16 | 6.79 token/sec | 294.55 ms | 2 | summarization | Optimum Habana 1.11.1 | ||
BridgeTower | 1 | bf16 | 321 token/sec | 49.84 ms | 16 | constrastive-image-text | Optimum Habana 1.11.1 | ||
ESMFold | 1 | bf16 | 2.97 token/sec | 336.7 ms | 1 | protein-folding | Optimum Habana 1.11.1 | ||
T5-3B Summarization Greedy | 1 | bf16 | 2.46 token/sec | 406.5 ms | 1 | summarization | Optimum Habana 1.11.1 | ||
HF-T5-Small-Translation-Greedy | 1 | bf16 | 30.85 token/sec | 129.65 ms | 4 | translation | Optimum Habana 1.11.1 | ||
Wav2vec(Audio Classification) | 1 | bf16 | 1002 token/sec | 3.99 ms | 4 | audio-classification | Optimum Habana 1.11.1 | ||
Wav2vec(Speech Recoginition) | 1 | bf16 | 16.62 token/sec | 240.67 ms | 4 | speech-recoginition | Optimum Habana 1.11.1 |
Intel Gaudi Reference Models
Model | #HPU | Precision | Throughput | Latency | Batch Size | Framework Version |
---|---|---|---|---|---|---|
Bert | 1 | bf16 | 154.1 token/sec | 155.74 ms | 24 | |
Unet2D | 1 | bf16 | 3730 img/sec | 17.15 ms | 64 | Lightning 2.2.4 |
Unet3D | 1 | bf16 | 64.1 img/sec | 31.2 ms | 2 | Lightning 2.2.4 |
Hugging Face Optimum Habana Gaudi
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub).
Model | #HPU | Precision | Throughput | Latency | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|
HF Bert (Language Modeling) | 1 | bf16 | 4 | language-modeling | Optimum Habana 1.11.1 | ||
HF Bert (Question Answering) | 1 | bf16 | 127.7 token/sec | 62.64 ms | 8 | question-answering | Optimum Habana 1.11.1 |
HF Bert (Text Classification) | 1 | bf16 | 434.4 token/sec | 18.41 ms | 8 | text-classification | Optimum Habana 1.11.1 |
HF Bart-Greedy | 1 | bf16 | 3.1 token/sec | 645.16 ms | 2 | summarization | Optimum Habana 1.11.1 |
HF ESMFold | 1 | bf16 | 13.9 token/sec | 71.94 ms | 1 | protein-folding | Optimum Habana 1.11.1 |
HF StableDiffusion V2-1 (512x512) | 1 | bf16 | 0.4 token/sec | 10000 ms | 4 | text to image generation | Optimum Habana 1.11.1 |
HF-T5-Small-Translation-Greedy | 1 | bf16 | 16.8 token/sec | 238.09 ms | 4 | translation | Optimum Habana 1.11.1 |
HF Wav2vec(Audio Classification) | 1 | bf16 | 494.6 token/sec | 8.08 ms | 4 | speech-recognition | Optimum Habana 1.11.1 |
HF Wav2vec(Speech Recoginition) | 1 | bf16 | 9.5 token/sec | 421.05 ms | 4 | speech-recognition | Optimum Habana 1.11.1 |
** These models used the previous 1.15.0 software release.
† For large language inference models, this is the average next-token latency.
System Configuration
Intel® Gaudi® Platform
System: HLS-1 with eight Intel® Gaudi® platform HL-205 mezzanine cards, two Intel® Xeon® Platinum 8280 CPUs at 2.70 GHz, and 756 GB of system memory
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.16.0-526
- PyTorch*: Models run with PyTorch v2.2.2 use this Docker* image.
- Environment: These workloads run using the Docker images running directly on the host operating system.
For each model’s support and validation coverage, see Model-References on GitHub. All information provided there is subject to change without notice. Your costs and results may vary.