Model Performance Data for Intel® Gaudi® 3 AI Accelerators
These performance numbers are measured using the latest SynapseAI* software release version 1.18.0, unless otherwise noted.
Note All models for both training and inference are using the PyTorch* 2.4.0 framework. Other applicable frameworks used for training or inference are noted for each model.
Inference
Large Language Models (LLM) for Throughput with Intel Gaudi 3 Accelerator
Model | # HPU | Precision | Input Length | Output Length | Batch Size | Throughput (tokens/sec) |
---|---|---|---|---|---|---|
LLaMA 2 7b | 1 | fp8 | 128 | 128 | 1,536 | 19,810 |
1 | fp8 | 128 | 2,048 | 217 | 6,763 | |
1 | fp8 | 2,048 | 128 | 153 | 2,029 | |
1 | fp8 | 2,048 | 2,048 | 75 | 2,734 | |
LLaMA 2 70b | 2 | fp8 | 128 | 128 | 1,750 | 4,433 |
2 | fp8 | 128 | 2,048 | 512 | 6,026 | |
2 | fp8 | 2,048 | 128 | 231 | 498 | |
2 | fp8 | 2,048 | 2,048 | 240 | 2,641 | |
LLaMA 3.1 8B | 1 | fp8 | 128 | 128 | 1,536 | 24,310 |
1 | fp8 | 128 | 2,048 | 768 | 18,830 | |
1 | fp8 | 2,048 | 128 | 256 | 2,652 | |
1 | fp8 | 2,048 | 2,048 | 364 | 7,405 | |
LLaMA 3.1 70B | 2 | fp8 | 128 | 128 | 3,516 | 3,711 |
2 | fp8 | 128 | 2,048 | 450 | 5,776 | |
2 | fp8 | 2,048 | 128 | 223 | 497 | |
2 | fp8 | 2,048 | 2,048 | 175 | 2,588 | |
LLaMA 3.1 70B | 8 | fp8 | 128 | 128 | 4,000 | 15,008 |
8 | fp8 | 128 | 2,048 | 600 | 15,711 | |
8 | fp8 | 2,048 | 128 | 400 | 1,600 | |
8 | fp8 | 2,048 | 2,048 | 600 | 8,946 | |
Mistral 7b | 1 | fp8 | 128 | 128 | 896 | 24,433 |
1 | fp8 | 128 | 2,048 | 120 | 13,726 | |
1 | fp8 | 2,048 | 128 | 120 | 2,085 | |
1 | fp8 | 2,048 | 2,048 | 44 | 4,970 |
System Configuration
Intel Gaudi 3 Platform
System: HLS-Gaudi3 with eight Intel Gaudi 3 platform HL-325L mezzanine cards, two Intel Xeon Platinum 8480+ CPUs at 2.0 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04
- Intel Gaudi software v1.18.0 (full software support details)
- PyTorch: Models run with PyTorch v2.4.0