FPGA AI Suite: IP Reference Manual

ID 768974
Date 7/31/2024
Public
Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Arria® 10: 265 MHz
  • Agilex™ 7: 400 MHz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Agilex™ 7 runtime host: SLES12 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth, and, depending on the graph, the host CPU speed.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

Details - FPGA AI Suite 2024.2

Architecture fMAX ALMs DSPs M20Ks Registers
AGX7_FP16_Generic 616 MHz 32.5 k 186 501 100 k
AGX7_FP16_Performance 600 MHz 103. k 1162 1533 346 k
AGX7_Small_NoSoftmax 612 MHz 16.7 k 80 296 54 k
AGX7_Small_Softmax 610 MHz 18.3 k 90 304 57 k
AGX7_Generic 600 MHz 38.6 k 202 778 126 k
AGX7_Performance 585 MHz 70.5 k 650 1278 209 k
AGX7_Performance_Giant 537 MHz 127.9 k 1546 2371 372 k

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 2325 176 71.2 89.5
AGX7_FP16_Performance 103. k 1162 8845 555 71.2 89.5
AGX7_Small_NoSoftmax 16.7 k 80 2774 168 70.9 89.6
AGX7_Small_Softmax 18.3 k 90 2765 167 70.9 89.5
AGX7_Generic 38.6 k 202 3247 250 70.9 89.5
AGX7_Performance 70.5 k 650 6231 397 70.9 89.5
AGX7_Performance_Giant 127.9 k 1546 4755 785 70.9 89.6

public/mobilenet-v2

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 3720 151 71.8 89.6
AGX7_FP16_Performance 103. k 1162 6979 374 71.8 89.6
AGX7_Small_NoSoftmax 16.7 k 80 4527 139 71.6 89.6
AGX7_Small_Softmax 18.3 k 90 4532 139 71.8 89.4
AGX7_Generic 38.6 k 202 2635 197 71.8 89.4
AGX7_Performance 70.5 k 650 5804 278 71.7 89.4
AGX7_Performance_Giant 127.9 k 1546 4242 720 71.7 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 4194 125 74.8 91.9
AGX7_FP16_Performance 103. k 1162 8833 294 74.8 91.9
AGX7_Generic 38.6 k 202 4074 147 74.7 91.8
AGX7_Performance 70.5 k 650 6881 229 74.7 91.8
AGX7_Performance_Giant 127.9 k 1546 5729 644 74.6 91.8

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 3831 171 75.8 92.1
AGX7_FP16_Performance 103. k 1162 11087 237 75.8 92.1
AGX7_Generic 38.6 k 202 4420 176 72.3 90.7
AGX7_Performance 70.5 k 650 9165 193 72.1 90.5
AGX7_Performance_Giant 127.9 k 1546 7637 319 72.4 90.4

public/resnet-50-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 3081 32 76.8 92.9
AGX7_FP16_Performance 103. k 1162 11626 164 76.8 92.9
AGX7_Small_NoSoftmax 16.7 k 80 5950 28 77.0 92.9
AGX7_Small_Softmax 18.3 k 90 5931 28 77.1 92.9
AGX7_Generic 38.6 k 202 4205 60 77.1 92.9
AGX7_Performance 70.5 k 650 10180 144 76.9 92.9
AGX7_Performance_Giant 127.9 k 1546 7838 230 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 2894 39 74.4 91.4
AGX7_FP16_Performance 103. k 1162 11970 193 74.4 91.4
AGX7_Small_NoSoftmax 16.7 k 80 4171 37 74.1 91.4
AGX7_Small_Softmax 18.3 k 90 4156 37 74.2 91.3
AGX7_Generic 38.6 k 202 4490 73 74.2 91.3
AGX7_Performance 70.5 k 650 10214 164 74.0 91.4
AGX7_Performance_Giant 127.9 k 1546 7595 245 74.1 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

AGX7_FP16_Generic 32.5 k 186 828 1.09
AGX7_FP16_Performance 103. k 1162 4298 7.14
AGX7_Small_NoSoftmax 16.7 k 80 1121 1.08
AGX7_Small_Softmax 18.3 k 90 1114 1.08
AGX7_Generic 38.6 k 202 1262 2.05
AGX7_Performance 70.5 k 650 3691 6.28
AGX7_Performance_Giant 127.9 k 1546 4198 9.06

public/yolo-v3-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 32.5 k 186 1422 4.2 62.27 31.58
AGX7_FP16_Performance 103. k 1162 6284 27.6 62.25 31.58
AGX7_Generic 38.6 k 202 1795 7.8 62.28 31.49
AGX7_Performance 70.5 k 650 2662 11.6 62.22 31.47
AGX7_Performance_Giant 127.9 k 1546 4918 30.0 62.25 31.46

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 32.5 k 186 1073 37 35.79 14.77
AGX7_FP16_Performance 103. k 1162 4567 113 35.81 14.78
AGX7_Generic 38.6 k 202 1969 66 35.76 14.74
AGX7_Performance 70.5 k 650 1604 40 35.73 14.72
AGX7_Performance_Giant 127.9 k 1546 2980 64 35.81 14.75

public/yolo-v8-nano detection

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Performance 103. k 1162 6154 96 51.15 36.52
AGX7_Generic 38.6 k 202 1942 40 51.14 36.50
AGX7_Performance 70.5 k 650 2135 34 51.10 36.48

public/yolo-v8-nano classification

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Performance 103. k 1162 4761 636 67.92 87.72
AGX7_Generic 38.6 k 202 1628 280 67.96 87.86
AGX7_Performance 70.5 k 650 1232 164 67.72 87.72

public/squeezenet1.1

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 649 225 58.5 81.1
AGX7_FP16_Performance 103. k 1162 4468 898 58.5 81.1
AGX7_Small_NoSoftmax 16.7 k 80 924 220 58.5 81.0
AGX7_Small_Softmax 18.3 k 90 920 219 58.5 81.0
AGX7_Generic 38.6 k 202 1713 532 58.5 81.0
AGX7_Performance 70.5 k 650 2155 432 58.4 81.0
AGX7_Performance_Giant 127.9 k 1546 2767 724 58.3 81.1

public/i3d_rgb_tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 32.5 k 186 449 0.62 65.79 82.89
AGX7_FP16_Performance 103. k 1162 2393 3.87 65.79 82.89
AGX7_Small_NoSoftmax 16.7 k 80 488 0.58 65.35 83.11
AGX7_Small_Softmax 18.3 k 90 487 0.57 65.57 83.11
AGX7_Generic 38.6 k 202 728 1.33 65.57 83.11
AGX7_Performance 70.5 k 650 2341 3.78 65.13 83.11
AGX7_Performance_Giant 127.9 k 1546 2571 4.20 65.79 82.89
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.