| Llama 2 13B | 16 | bf16 | 10.16 samples/sec | | | 256 | DeepSpeed 0.14.0 |
| Llama 2 70B | 64 | bf16 | 9.13 samples/sec | | | 1024 | DeepSpeed 0.14.0 |
| Llama 2 70B | 64 | FP8 | 13.17 samples/sec | | | 1024 | DeepSpeed 0.14.0 |
| MIXTRAL-8x7B-32K | 32 | bf16 | 0.7 samples/sec | 88.46 | 345 min | 128 | DeepSpeed 0.14.0 |
| Stable Diffusion | 64 | bf16 | 11122 img/sec | | | 32 | Lightning 2.3.3 |
| Stable Diffusion Fine Tuning** | 1 | bf16 | 73 img/sec | | | 7 | Lightning 2.3.3 |
| Stable Diffusion Fine Tuning Textual Inversion** | 1 | bf16 | 19.7 img/sec | | | 7 | Lightning 2.3.3 |
| ResNet50 LARS | 32 | bf16 | 18399 img/sec | 76.38 | 7.26 min | 256 | |
| ResNet50 LARS | 8 | bf16 | 48166.02 img/sec | 76.04 | 17.81 min | 256 | |
| ResNet50 LARS | 1 | bf16 | 6201.14 img/sec | | | 256 | |
| BERT Pre Training Phase 1 (torch.compile) | 32 | bf16 | 33179.52 sent/sec | | 238 min | 64 | |
| BERT Pre Training Phase 1 (torch.compile) | 8 | bf16 | 8593.03 sent/sec | 0 | | 64 | |
| BERT Pre Training Phase 1 (torch.compile) | 1 | bf16 | 1074.45 sent/sec | | | 64 | |
| BERT Pre Training Phase 2 (torch.compile) | 32 | bf16 | 9861.81 sent/sec | 0 | 87 min | 16 | |
| BERT Pre Training Phase 2 (torch.compile) | 8 | bf16 | 2568.65 sent/sec | 0 | | 16 | |
| BERT Pre Training Phase 2 (torch.compile) | 1 | bf16 | 320.41 sent/sec | | | 16 | |
| BERT SQUAD Fine Tuning | 8 | bf16 | 2013 sent/sec | 90.52 | 4.68 min | 24 | |
| ResNext101 | 8 | bf16 | 21851 img/sec | 77.81 | 102 min | 256 | |
| Transformer | 8 | bf16 | 1121879 token/sec | 27.9 | 236 min | 8,192 | |
| Unet2D (torch.compile) | 8 | bf16 | 19888 img/sec | 72.5 | 10.21 min | 64 | Lightning 2.3.3 |
| Unet3D PTL | 8 | bf16 | 252 img/sec | 74.17 | 17.96 min | 2 | Lightning 2.3.3 |