Model Performance Data for Intel® Gaudi® 2 AI Accelerators
These performance numbers are measured using the latest SynapseAI* software release version 1.18.0, unless otherwise noted.
Note All models for both training and inference are using the PyTorch* 2.4.0 framework. Other applicable frameworks used for training or inference are noted for each model.
Large Language Model (LLM) Throughput: Intel Gaudi 2 Accelerator
Model | # HPU | Sequence Length | Precision | Batch Size | Throughput (tokens/sec) |
---|---|---|---|---|---|
LLaMA V2 7B | 8 | 4,096 | FP8 | 1,024 | 68,464 |
LLaMA V2 13B | 16 | 4,096 | FP8 | 256 | 58,282 |
LLaMA V2 70B | 64 | 4,096 | FP8 | 1,024 | 54,274 |
LLaMA V3.1 8B | 8 | 8,192 | FP8 | 128 | 36,309 |
LLaMA V3.1 70B | 64 | 8,192 | FP8 | 128 | 43,677 |
Intel Gaudi 2 Accelerator with MLPerf* 3.1 Training Performance
These performance numbers have been generated with the latest version of SynapseAI* and are improvements over the officially submitted numbers posted on the MLCommons website.
Model | #HPU | Precision | Time To Train | Frameworks Version |
---|---|---|---|---|
MLPerf 3.1 - GPT3 | 384 | fp8 | 153.58 min† | |
MLPerf 3.1 - GPT3 | 256 | fp8 | 223.75 min‡ | |
MLPerf 3.1 - Stable Diffusion v2 | 64 | bf16 | 19.4 min‡ | Lightning 2.1.2 |
MLPerf 3.1 - ResNet | 8 | bf16 | 16.4 min | |
MLPerf 3.1 - BERT | 8 | bf16 | 15.01 min |
†The GPT3 measurement with 384 cards was taken using a prelaunch version of the SynapseAI 1.13.0 software stack.
‡ The GPT measurement with 256 cards and Stable Diffusion* were taken using the SynapseAI 1.13.0 software stack.
System Configuration
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.18.0 (full software support details)
- PyTorch: Models run with PyTorch v2.4.0