Model Performance Data for Intel® Gaudi® AI Accelerators
Unless noted, performance numbers are measured using the latest SynapseAI software release: version 1.16.0-526.
Note All models for training and inference use the PyTorch* v2.2.2 framework. Other applicable frameworks used for training or inference are noted for each model.
Training Performance Highlights
DeepSpeed for Megatron 0.12.4 | Llama2 70B-1,024 BS=4096 | Llama2 70B-512 BS=2048 | Llama2 70B-256 BS=1024
Intel® Gaudi® 2 with MLPerf* v3.1
These performance numbers were generated with previous versions of Intel Gaudi software. The numbers will be updated with the upcoming release of new MLPerf* training, which will be part of the next Intel Gaudi software release.
Model | #HPU | Precision | Time To Train | Frameworks Version |
---|---|---|---|---|
MLPerf 3.1 - GPT3 | 384 | fp8 | 153.58 min** | |
MLPerf 3.1 - GPT3 | 256 | fp8 | 223.75 min† | |
MLPerf 3.1 - Stable Diffusion v2 | 64 | bf16 | 19.4 min† | Lightning 2.1.2 |
MLPerf 3.1 - ResNet | 8 | bf16 | 16.4 min‡ | |
MLPerf 3.1 - BERT | 8 | bf16 | 15.01 min‡ |
** The GPT-3* measurement with 384 cards was taken using a prelaunch version of the Intel Gaudi software v1.13.0 software stack.
† The GPT-3 measurement with 256 cards and Stable Diffusion* measurement were taken using the Intel Gaudi software stack v1.13.0.
‡ The ResNet* and BERT measurements were taken using the Intel Gaudi software v1.15.0.
Intel Gaudi 2 Large Language Models
Model | #HPU | Precision | Throughput | Sequence Length | TP,PP,DP | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
LLaMA 2 7B | 8 | FP8 | 68439 tokens/sec | 4,096 | 1,1,8 | 1,024 | Megatron DeepSpeed PR #372 |
LLaMA 2 13B | 16 | FP8 | 52428 tokens/sec | 4,096 | 2,2,4 | 256 | Megatron DeepSpeed PR #372 |
LLaMA 2 70B | 64 | FP8 | 52838 tokens/sec | 4,096 | 8,2,4 | 1,024 | Megatron DeepSpeed PR #372 |
LLaMA 2 70B** | 256 | bf16 | 137625 tokens/sec | 4,096 | 8,8,4 | 1,024 | Megatron DeepSpeed PR #307 |
LLaMA 2 70B** | 512 | bf16 | 226918 tokens/sec | 4,096 | 8,8,8 | 2048 | Megatron DeepSpeed PR #307 |
LLaMA 2 70B** | 1024 | bf16 | 427622 tokens/sec | 4,096 | 8,8,16 | 4096 | Megatron DeepSpeed PR #307 |
TP, PP, DP: These are the tensor parallel, pipeline parallel, and data parallel parameters for the DeepSpeed for Megatron training.
Intel Gaudi 2 Reference Models
Model | #HPU | Precision | Throughput | Acc | TTT | Batch | Framework Version |
---|---|---|---|---|---|---|---|
Llama 2 13B | 16 | bf16 | 10 samples/sec | 256 | DeepSpeed 0.14.0 | ||
Llama 2 70B | 64 | bf16 | 8.88 samples/sec | 1024 | DeepSpeed 0.14.0 | ||
Llama 2 70B | 64 | FP8 | 12.9 samples/sec | 1024 | DeepSpeed 0.14.0 | ||
Stable Diffusion | 64 | bf16 | 11145.8 img/sec | 32 | Lightning 2.2.4 | ||
Stable Diffusion Fine Tuning** | 1 | bf16 | 71 img/sec | 7 | Lightning 2.2.4 | ||
Stable Diffusion Fine Tuning Textual Inversion** | 1 | bf16 | 20.9 img/sec | 7 | Lightning 2.2.4 | ||
ResNet50 LARS | 32 | bf16 | 18399 img/sec | 76.15 | 7.81 min | 256 | |
ResNet50 LARS | 8 | bf16 | 47070 img/sec | 76.14 | 18.98 min | 256 | |
ResNet50 LARS | 1 | bf16 | 6233 img/sec | 256 | |||
BERT Pre Training Phase 1 | 32 | bf16 | 32450 sent/sec | 254 min | 64 | ||
BERT Pre Training Phase 1 | 8 | bf16 | 9218 sent/sec | 0 | 64 | ||
BERT Pre Training Phase 1 | 1 | bf16 | 1178 sent/sec | 64 | |||
BERT Pre Training Phase 2 | 32 | bf16 | 10861 sent/sec | 0 | 80.21 min | 16 | |
BERT Pre Training Phase 2 | 8 | bf16 | 2777.5 sent/sec | 0 | 16 | ||
BERT Pre Training Phase 2 | 1 | bf16 | 351 sent/sec | 16 | |||
BERT SQUAD Fine Tuning | 8 | bf16 | 2075 sent/sec | 90.64 | 4.68 min | 24 | |
BERT SQUAD Fine Tuning | 1 | bf16 | 285 sent/sec | 24 | |||
ResNext101 | 8 | bf16 | 22184 img/sec | 77.93 | 100 min | 256 | |
ResNext101 | 1 | bf16 | 2853 img/sec | 256 | |||
SSD | 8 | bf16 | 14651 img/sec | 23.02 | 10.3 min | 128 | |
SSD | 1 | bf16 | 2140 img/sec | 128 | |||
Transformer | 8 | bf16 | 1110435 token/sec | 27.8 | 241.73 min | 8,192 | |
Transformer | 1 | bf16 | 138173.66 token/sec | 8,192 | |||
Unet2D (torch.compile) | 8 | bf16 | 19938.29 img/sec | 72.66 | 12.55 min | 64 | Lightning 2.2.4 |
Unet2D (torch.compile) | 1 | bf16 | 2626 img/sec | 64 | Lightning 2.2.4 | ||
Unet3D | 8 | bf16 | 252 img/sec | 254.7 img/sec | 74.26 | 2 | Lightning 2.2.4 |
Unet3D | 1 | bf16 | 32.42 img/sec | 2 | Lightning 2.2.4 |
Hugging Face* Optimum with Intel Gaudi 2
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub*).
Model | #HPU | Precision | Throughput | Acc | TTT | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|---|
Llama2-70B Fine Tuning FSDP (LoRA with torch.compile) | 8 | bf16 | 1.3 sentences/sec | 2.13 | 81.75 min | 10 | language-modeling | Optimum Habana 1.11.1 |
Llama2-70B Fine Tuning (LoRA) | 8 | bf16 | 2.6 sentences/sec | 2.13 | 39.43 min | 10 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
Llama1-7B Fine Tuning (LoRA) | 8 | bf16 | 150 sentences/sec | 2.35 | 5.08 min | 64 | language-modeling | Optimum Habana 1.11.1 |
Falcon-180B Fine Tuning (LoRA) | 8 | bf16 | 2.67 sentences/sec | 3.71 | 149.41 min | 1 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
Falcon-40B Fine Tuning (LoRA) | 8 | bf16 | 27.99 sentences/sec | 4.06 | 15.85 min | 1 | language-modeling | Optimum Habana 1.11.1 |
GPTJ-CLM | 8 | bf16 | 22.24 sentences/sec | 0.53 | 17.18 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
GPTNEOX-20B-CLM | 16 | bf16 | 294 sentences/sec | 0.53 | 27.21 min | 2 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
BridgeTower | 8 | bf16 | 726 sentences/sec | 20.63 min | 40 | contrastive-image-text | Optimum Habana 1.11.1 | |
GPT2 | 8 | bf16 | 651 sentences/sec | 0.4 | 1.61 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
GPT2-XL | 8 | bf16 | 94.24 sentences/sec | 0.47 | 6.55 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
ALBERT-Large | 8 | bf16 | 2479 sentences/sec | 91.7 | 1.86 min | 32 | question-answering | Optimum Habana 1.11.1 |
ALBERT-XXL | 8 | bf16 | 456 sentences/sec | 94.8 | 6.73 min | 16 | question-answering | Optimum Habana 1.11.1 |
BERT Base (torch.compile) | 8 | bf16 | 4172 sentences/sec | 85.35 | 1.16 min | 24 | question-answering | Optimum Habana 1.11.1 |
BERT-Large Fine Tuning (torch.compile) | 8 | bf16 | 2117 sentences/sec | 93.4 | 1.98 min | 32 | question-answering | Optimum Habana 1.11.1 |
ClipRoBERTa | 8 | bf16 | 16366 images/sec | 9.35 min | 64 | contrastive-image-text | Optimum Habana 1.11.1 | |
DistilBERT | 8 | bf16 | 9992 sentences/sec | 82.43 | 0.56 min | 64 | question-answering | Optimum Habana 1.11.1 |
Flan-T5 XXL | 8 | bf16 | 26.99 sentences/sec | 37.06 | 369.91 min | 22 | question-answering | Optimum Habana 1.11.1 |
RoBERTa Base | 8 | bf16 | 6640 sentences/sec | 92.14 | 0.73 min | 64 | question-answering | Optimum Habana 1.11.1 |
RoBERTa Large (torch.compile) | 8 | bf16 | 2122 sentences/sec | 94.43 | 2.06 min | 32 | question-answering | Optimum Habana 1.11.1 |
Swin Transformer | 8 | bf16 | 5841 images/sec | 99.09 | 1.8 min | 160 | image-classification | Optimum Habana 1.11.1 |
T5-LARGE | 8 | bf16 | 87.57 sentences/sec | 44.34 | 246.95 min | 4 | summarization | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
T5-Small | 8 | bf16 | 553 sentences/sec | 26.19 | 106.61 min | 4 | translation | DeepSpeed 0.14.0 Optimum Habana 1.11.1 |
Vision Transformer | 8 | bf16 | 6496 images/sec | 98.91 | 1 min | 128 | image-classification | Optimum Habana 1.11.1 |
Wav2Vec2.0 AC | 8 | bf16 | 1960 sentences/sec | 80.94 | 2.45 min | 16 | speech-recognition | Optimum Habana 1.11.1 |
Wav2Vec2.0 ASR | 8 | bf16 | 76 sentences/sec | 3.96 | 20.65 min | 4 | speech-recognition | Optimum Habana 1.11.1 |
MosaicML for Intel Gaudi 2
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
MosaicML MPT-1B | 8 | bf16 | 24145.17 samples/sec | 7.35 | 13.41 min | 512 | PyTorch 2.2.2 |
MosaicML MPT-70B | 32 | bf16 | 17937.17 samples/sec | 6.95 | 106.43 min | 512 | PyTorch 2.2.2 |
Intel Gaudi Reference Models
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
ResNet50 Keras LARS (torch.compile) | 32 | bf16 | 45063 img/sec | 76.34 | 24.5 min | 256 | |
ResNet50 Keras LARS (torch.compile) | 8 | bf16 | 11633 img/sec | 76.55 | 69.76 min | 256 | |
ResNet50 Keras LARS (torch.compile) | 1 | bf16 | 1621 img/sec | 256 | |||
BERT Pre Training combine | 32 | bf16 | 4792.62 sent/sec | 1751 min | 64 | ||
BERT Pre Training combine | 8 | bf16 | 1234 sent/sec | 64 | |||
BERT Pre Training combine | 1 | bf16 | 155 sent/sec | 64 | |||
BERT Pre Training Phase 1 | 32 | bf16 | 5732.07 sent/sec | Loss: | 1315 min | 64 | |
BERT Pre Training Phase 1 | 8 | bf16 | 1481.31 sent/sec | 64 | |||
BERT Pre Training Phase 1 | 1 | bf16 | 186.2 sent/sec | 64 | |||
BERT Pre Training Phase 2 | 32 | bf16 | 1917.35 sent/sec | Loss: | 436 min | 16 | |
BERT Pre Training Phase 2 | 8 | bf16 | 487.99 sent/sec | 16 | |||
BERT Pre Training Phase 2 | 1 | bf16 | 61.25 sent/sec | 16 | |||
BERT SQUAD Fine Tuning | 8 | bf16 | 404.52 sent/sec | 90.68 | 12.96 min | 24 | |
BERT SQUAD Fine Tuning | 1 | bf16 | 53.58 sent/sec | 24 | |||
BART Fine Tuning | 8 | bf16 | 32 | ||||
DINO | 8 | bf16 | 947 exmpl/sec | 77 | 2315 min | 64 | |
MobileNetV2 | 8 | bf16 | 12632 img/sec | 71.49 | 505 min | 256 | |
ResNet152 | 8 | bf16 | 4967 img/sec | 78.63 | 399 min | 128 | |
SSD** | 8 | bf16 | 3439 img/sec | 128 | |||
Transformer | 8 | bf16 | 187860.33 tokens/sec | 28.1 | 1023 min | 4096 | |
Unet2D (torch.compile) | 8 | bf16 | 4773 img/sec | 72.86 | 63 min | 64 | Lightning 2.2.4 |
Unet3D | 8 | bf16 | 62 img/sec | 74.33 | 73 min | 2 | Lightning 2.2.4 |
YOLOX | 8 | bf16 | 313.37 img/sec | 39.75 | 2326.8 min | 16 |
Hugging Face Optimum with Intel Gaudi
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub).
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Task | Framework Version |
---|---|---|---|---|---|---|---|---|
GPT2-XL | 8 | bf16 | 19.37 sentences/sec | 0.47 | 74 min | 4 | language-modeling | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
GPT2 | 8 | bf16 | 167.41 sentences/sec | 0.41 | 4.2 min | 4 | language-modeling | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
T5-LARGE | 8 | bf16 | 50 sentences/sec | 44.34 | 365 min | 4 | summarization | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
T5-Small | 8 | bf16 | 192 sentences/sec | 26.12 | 116.8 min | 4 | translation | DeepSpeed 0.14.0, Optimum Habana 1.11.1 |
ALBERT-L | 8 | bf16 | 490.11 sentences/sec | 92.57 | 7.9 min | 32 | question-answering | Optimum Habana 1.11.1 |
ALBERT-XXL | 8 | bf16 | 75.34 sentences/sec | 94.88 | 41.4 min | 12 | question-answering | Optimum Habana 1.11.1 |
BERT-BASE FT (torch.compile) | 8 | bf16 | 1178 sentences/sec | 85.53 | 3 min | 24 | question-answering | Optimum Habana 1.11.1 |
BERT-Large FT (torch.compile) | 8 | bf16 | 413 sentences/sec | 93.29 | 8.6 min | 24 | question-answering | Optimum Habana 1.11.1 |
Clip-RoBERTa | 8 | bf16 | 895 images/sec | 45.2 min | 64 | contrastive-image-text | Optimum Habana 1.11.1 | |
DistilBERT | 8 | bf16 | 1524 sentences/sec | 85.72 | 3 min | 8 | question-answering | Optimum Habana 1.11.1 |
RoBERTa Base | 8 | bf16 | 1066 sentences/sec | 91.81 | 3.13 min | 12 | question-answering | Optimum Habana 1.11.1 |
RoBERTa Large (torch.compile) | 8 | bf16 | 410 sentences/sec | 94.76 | 8.6 min | 12 | question-answering | Optimum Habana 1.11.1 |
Swin Transformer | 8 | bf16 | 1573 images/sec | 98.68 | 4.8 min | 64 | question-answering | Optimum Habana 1.11.1 |
Vision Transformer | 8 | bf16 | 2461 images/sec | 97.19 | 2.81 min | 64 | question-answering | Optimum Habana 1.11.1 |
Wav2Vec2-AC | 8 | bf16 | 667 sentences/sec | 81.84 | 6.3 min | 16 | speech-recognition | Optimum Habana 1.11.1 |
Wav2Vec2-ASR | 8 | bf16 | 41.83 sentences/sec | 4.2 | 36.73 min | 4 | speech-recognition | Optimum Habana 1.11.1 |
System Configuration
Intel® Gaudi® Platform
System: HLS-1 with eight Intel® Gaudi® platform HL-205 mezzanine cards, two Intel® Xeon® Platinum 8280 CPUs at 2.70 GHz, and 756 GB of system memory
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.16.0-526
- PyTorch*: Models run with PyTorch v2.2.2 use Docker* image.
- Environment: These workloads run using the Docker images running directly on the host operating system.
For each model’s support and validation coverage, see Model-References on GitHub. All information provided there is subject to change without notice. Your costs and results may vary.