Model Performance Data for Intel® Gaudi® 2 AI Accelerators
These performance numbers are measured using the latest SynapseAI* software release version 1.19.0, unless otherwise noted.
Note All models for both training and inference are using the PyTorch* 2.5.1 framework. Other applicable frameworks used for training or inference are noted for each model.
INFERENCE | TRAINING
Large Language Model (LLM) Throughput: Intel Gaudi 2 Accelerator
Max Throughput [TpS - higher is better] | |||||
---|---|---|---|---|---|
Model | # HPU | Sequence Length | Precision | Batch Size | Throughput (tokens/sec) |
LLaMA V2 7B | 8 | 4096 | FP8 | 1024 | 70523 |
LLaMA V2 13B | 16 | 4096 | FP8 | 256 | 59397 |
LLaMA V2 70B | 64 | 4096 | FP8 | 1024 | 54614 |
LLaMA V3.1 8B | 8 | 8192 | FP8 | 128 | 37440 |
LLaMA V3.1 70B | 64 | 8192 | FP8 | 128 | 43332 |
Intel Gaudi 2 Accelerator with MLPerf* 3.1 Training Performance
Model | #HPU | Precision | Time To Train | Frameworks Version |
---|---|---|---|---|
MLPerf 3.1 - GPT3 | 384 | fp8 | 153.58 min† | |
MLPerf 3.1 - GPT3 | 256 | fp8 | 223.75 min‡ | |
MLPerf 3.1 - Stable Diffusion v2 | 64 | bf16 | 19.4 min‡ | Lightning 2.1.2 |
MLPerf 3.1 - ResNet | 8 | bf16 | 16.4 min | |
MLPerf 3.1 - BERT | 8 | bf16 | 15.01 min |
†The GPT3 measurement with 384 cards was taken using a prelaunch version of the SynapseAI 1.13.0 software stack.
‡ The GPT measurement with 256 cards and Stable Diffusion* were taken using the SynapseAI 1.13.0 software stack.
System Configuration
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.19.0 (full software support details)
- PyTorch: Models run with PyTorch v2.5.1
Stay Informed
Register for the latest Intel Gaudi AI accelerator developer news, events, training, and updates.