FPGA AI Suite: IP Reference Manual

ID 768974
Date 12/16/2024
Public
Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Arria® 10: 265 MHz
  • Agilex™ 7: 400 MHz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Agilex™ 7 runtime host: SUSE Linux Enterprise Server 15 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

The IP Throughput column in the tables that follow shows the performance for the portion of the graph that runs on the FPGA device. In many cases, the entire graph runs on the FPGA device. The IP Throughput is representative of performance if the IP is used in a hostless configuration.

The IP+host Throughput column in the tables that follow shows the performance including the host. The IP+host performance may be lower than IP-only performance if the host is unable to stream data to the FPGA device quickly enough, or if the host is limited by some of the processing associated with the graph (for example, the host performs NMS for the YOLOv3 graph). Achievable IP+host performance depends on the speed and loading of the host and the FPGA AI Suite IP.

Details - FPGA AI Suite 2024.3

Architecture fMAX ALMs DSPs M20Ks Registers
AGX7_FP16_Generic 600 MHz 33.6 k 186 511 95 k
AGX7_FP16_Performance 605 MHz 103.9 k 1162 1533 324 k
AGX7_Small_NoSoftmax 610 MHz 17.2 k 80 296 49 k
AGX7_Small_Softmax 616 MHz 18.6 k 90 304 57 k
AGX7_Generic 600 MHz 38.9 k 202 778 113 k
AGX7_Performance 585 MHz 70.5 k 650 1278 207 k
AGX7_Performance_Giant 535 MHz 127.8 k 1546 2371 359 k

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 2261 171 171 71.2 89.5
AGX7_FP16_Performance 103.9 k 1162 9117 572 567 71.2 89.5
AGX7_Small_NoSoftmax 17.2 k 80 2770 167 167 70.9 89.6
AGX7_Small_Softmax 18.6 k 90 2796 169 168 70.9 89.5
AGX7_Generic 38.9 k 202 3306 255 251 70.9 89.5
AGX7_Performance 70.5 k 650 8893 566 399 70.9 89.5
AGX7_Performance_Giant 127.8 k 1546 8987 1483 764 71.0 89.6

public/mobilenet-v2

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3653 148 147 71.8 89.6
AGX7_FP16_Performance 103.9 k 1162 6948 372 367 71.8 89.6
AGX7_Small_NoSoftmax 17.2 k 80 4609 141 138 71.6 89.6
AGX7_Small_Softmax 18.6 k 90 4645 142 139 71.8 89.4
AGX7_Generic 38.9 k 202 2720 203 198 71.8 89.4
AGX7_Performance 70.5 k 650 7166 343 276 71.7 89.4
AGX7_Performance_Giant 127.8 k 1546 6370 1081 726 71.8 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 4085 122 121 74.8 91.9
AGX7_FP16_Performance 103.9 k 1162 8717 290 288 74.8 91.9
AGX7_Generic 38.9 k 202 4184 151 145 74.7 91.8
AGX7_Performance 70.5 k 650 8716 290 226 74.7 91.8
AGX7_Performance_Giant 127.8 k 1546 7539 847 618 74.7 91.7

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3774 169 165 75.8 92.1
AGX7_FP16_Performance 103.9 k 1162 11260 240 234 75.8 92.1
AGX7_Generic 38.9 k 202 4530 181 174 72.3 90.7
AGX7_Performance 70.5 k 650 11293 246 201 72.1 90.5
AGX7_Performance_Giant 127.8 k 1546 8492 355 304 72.6 90.6

public/resnet-50-tf

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3005 32 32 76.8 92.9
AGX7_FP16_Performance 103.9 k 1162 11715 166 164 76.8 92.9
AGX7_Small_NoSoftmax 17.2 k 80 5935 28 28 77.0 92.9
AGX7_Small_Softmax 18.6 k 90 5989 28 28 77.1 92.9
AGX7_Generic 38.9 k 202 4206 60 60 77.1 92.9
AGX7_Performance 70.5 k 650 11540 163 143 76.9 92.9
AGX7_Performance_Giant 127.8 k 1546 8067 237 229 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 2822 38 38 74.4 91.4
AGX7_FP16_Performance 103.9 k 1162 12139 195 195 74.4 91.4
AGX7_Small_NoSoftmax 17.2 k 80 4161 37 37 74.1 91.4
AGX7_Small_Softmax 18.6 k 90 4203 37 37 74.2 91.3
AGX7_Generic 38.9 k 202 4489 73 73 74.2 91.3
AGX7_Performance 70.5 k 650 12119 195 162 74.0 91.4
AGX7_Performance_Giant 127.8 k 1546 8379 270 247 74.1 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

AGX7_FP16_Generic 33.6 k 186 825 1.09
AGX7_FP16_Performance 103.9 k 1162 4552 7.57
AGX7_Small_NoSoftmax 17.2 k 80 1140 1.10
AGX7_Small_Softmax 18.6 k 90 1153 1.11
AGX7_Generic 38.9 k 202 1319 2.14
AGX7_Performance 70.5 k 650 4331 7.36
AGX7_Performance_Giant 127.8 k 1546 5426 11.71

public/yolo-v3-tf

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.6 k 186 1428 4.2 4 62.27 31.58
AGX7_FP16_Performance 103.9 k 1162 6347 27.9 28 62.25 31.58
AGX7_Generic 38.9 k 202 1901 8.2 8 62.28 31.49
AGX7_Performance 70.5 k 650 6170 27.0 11 62.22 31.47
AGX7_Performance_Giant 127.8 k 1546 6634 40.5 30 62.25 31.46

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.6 k 186 1200 41 36 35.79 14.77
AGX7_FP16_Performance 103.9 k 1162 4680 116 113 35.81 14.78
AGX7_Generic 38.9 k 202 2433 82 66 35.76 14.74
AGX7_Performance 70.5 k 650 4647 115 40 35.73 14.72
AGX7_Performance_Giant 127.8 k 1546 5028 109 64 35.81 14.75

public/yolo-v8-nano detection

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Performance 103.9 k 1162 6728 94 91 51.15 36.52
AGX7_Generic 38.9 k 202 2427 50 39 51.14 36.50
AGX7_Performance 70.5 k 650 6720 95 32 51.10 36.48

public/yolo-v8-nano classification

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Performance 103.9 k 1162 10345 1384 67.92 87.72
AGX7_Generic 38.9 k 202 5489 943 67.96 87.86
AGX7_Performance 70.5 k 650 10178 1358 67.72 87.72

public/squeezenet1.1

Architecture ALMs DSPs DDR1

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 631 218 219 58.5 81.1
AGX7_FP16_Performance 103.9 k 1162 4679 940 886 58.5 81.1
AGX7_Small_NoSoftmax 17.2 k 80 923 220 219 58.5 81.0
AGX7_Small_Softmax 18.6 k 90 933 222 222 58.5 81.0
AGX7_Generic 38.9 k 202 1722 535 536 58.5 81.0
AGX7_Performance 70.5 k 650 4654 932 419 58.4 81.0
AGX7_Performance_Giant 127.8 k 1546 3631 951 735 58.3 81.1

public/i3d_rgb_tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 442 0.61 65.79 82.89
AGX7_FP16_Performance 103.9 k 1162 2562 4.14 65.79 82.89
AGX7_Small_NoSoftmax 17.2 k 80 492 0.58 65.35 82.89
AGX7_Small_Softmax 18.6 k 90 496 0.59 65.57 82.89
AGX7_Generic 38.9 k 202 742 1.36 65.57 83.11
AGX7_Performance 70.5 k 650 2486 4.01 65.13 83.11
AGX7_Performance_Giant 127.8 k 1546 2839 4.64 65.79 82.89
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.