FPGA AI Suite: IP Reference Manual

ID 768974
Date 3/29/2024
Public
Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Arria® 10: 265 MHz
  • Agilex™ 7: 400 MHz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Arria® 10 runtime host: CentOS7 host on an Intel® Xeon® processor E5-1650 @ 3.6 GHz
  • Agilex™ 7 runtime host: SLES12 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth, and, depending on the graph, the host CPU speed.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

Details - FPGA AI Suite 2024.1

Architecture fMAX ALMs DSPs M20Ks Registers
A10_FP16_Generic 306 MHz 27.8 k 186 498 72 k
A10_FP16_Performance 280 MHz 84.2 k 1162 1482 254 k
A10_Small_NoSoftmax 345 MHz 14.6 k 80 247 40 k
A10_Small_Softmax 325 MHz 15.7 k 90 255 43 k
A10_Generic 299 MHz 30.5 k 202 617 81 k
A10_Performance 275 MHz 57.8 k 650 948 171 k
AGX7_FP16_Generic 600 MHz 32.1 k 186 501 96 k
AGX7_FP16_Performance 600 MHz 99.3 k 1162 1495 331 k
AGX7_Small_NoSoftmax 616 MHz 16.3 k 80 296 54 k
AGX7_Small_Softmax 616 MHz 17.7 k 90 304 60 k
AGX7_Generic 616 MHz 35.1 k 202 759 117 k
AGX7_Performance 581 MHz 62.4 k 650 1240 202 k
AGX7_Performance_Giant 565 MHz 117.9 k 1546 2333 407 k

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 1181 89 71.2 89.5
A10_FP16_Performance 84.2 k 1162 4685 294 71.2 89.5
A10_Small_NoSoftmax 14.6 k 80 1142 98 69.8 89.1
A10_Small_Softmax 15.7 k 90 1080 93 69.6 89.1
A10_Generic 30.5 k 202 985 131 69.6 89.1
A10_Performance 57.8 k 650 2619 297 70.0 89.0
AGX7_FP16_Generic 32.1 k 186 2262 171 71.2 89.5
AGX7_FP16_Performance 99.3 k 1162 8921 560 71.2 89.5
AGX7_Small_NoSoftmax 16.3 k 80 2788 169 70.9 89.6
AGX7_Small_Softmax 17.7 k 90 2790 169 70.9 89.5
AGX7_Generic 35.1 k 202 3327 257 70.9 89.5
AGX7_Performance 62.4 k 650 6095 390 70.9 89.5
AGX7_Performance_Giant 117.9 k 1546 9130 1511 70.9 89.6

public/mobilenet-v2

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 1996 81 71.8 89.6
A10_FP16_Performance 84.2 k 1162 3608 193 71.8 89.6
A10_Small_NoSoftmax 14.6 k 80 2417 86 70.1 88.6
A10_Small_Softmax 15.7 k 90 2318 82 70.0 88.6
A10_Generic 30.5 k 202 850 106 70.0 88.6
A10_Performance 57.8 k 650 2090 197 69.6 88.3
AGX7_FP16_Generic 32.1 k 186 3644 148 71.8 89.6
AGX7_FP16_Performance 99.3 k 1162 6968 373 71.8 89.6
AGX7_Small_NoSoftmax 16.3 k 80 4549 140 71.6 89.6
AGX7_Small_Softmax 17.7 k 90 4556 140 71.8 89.4
AGX7_Generic 35.1 k 202 2701 202 71.8 89.4
AGX7_Performance 62.4 k 650 5802 279 71.7 89.4
AGX7_Performance_Giant 117.9 k 1546 6766 1154 71.7 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 2214 66 74.8 91.9
A10_FP16_Performance 84.2 k 1162 4990 165 74.8 91.9
A10_Generic 30.5 k 202 1594 84 73.2 90.9
A10_Performance 57.8 k 650 2983 170 72.4 90.3
AGX7_FP16_Generic 32.1 k 186 4097 122 74.8 91.9
AGX7_FP16_Performance 99.3 k 1162 8808 293 74.8 91.9
AGX7_Generic 35.1 k 202 4111 148 74.7 91.8
AGX7_Performance 62.4 k 650 7314 244 74.7 91.8
AGX7_Performance_Giant 117.9 k 1546 7817 882 74.6 91.8

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 2071 93 75.8 92.1
A10_FP16_Performance 84.2 k 1162 6625 141 75.8 92.1
AGX7_FP16_Generic 32.1 k 186 3797 170 75.8 92.1
AGX7_FP16_Performance 99.3 k 1162 11156 238 75.8 92.1
AGX7_Generic 35.1 k 202 4625 185 72.3 90.7
AGX7_Performance 62.4 k 650 11150 238 72.3 90.5
AGX7_Performance_Giant 117.9 k 1546 8720 370 72.4 90.6

public/resnet-50-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 1560 16 76.8 92.9
A10_FP16_Performance 84.2 k 1162 6595 93 76.8 92.9
A10_Small_NoSoftmax 14.6 k 80 2024 17 76.6 92.7
A10_Small_Softmax 15.7 k 90 1912 16 76.4 92.6
A10_Generic 30.5 k 202 1367 31 76.4 92.6
A10_Performance 57.8 k 650 4294 97 76.5 92.7
AGX7_FP16_Generic 32.1 k 186 3002 32 76.8 92.9
AGX7_FP16_Performance 99.3 k 1162 11555 163 76.8 92.9
AGX7_Small_NoSoftmax 16.3 k 80 5983 28 77.0 92.9
AGX7_Small_Softmax 17.7 k 90 5985 28 77.1 92.9
AGX7_Generic 35.1 k 202 4310 62 77.1 92.9
AGX7_Performance 62.4 k 650 10094 143 76.9 92.9
AGX7_Performance_Giant 117.9 k 1546 8227 242 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 1464 20 74.4 91.4
A10_FP16_Performance 84.2 k 1162 6999 113 74.4 91.4
A10_Small_NoSoftmax 14.6 k 80 1419 21 73.8 91.2
A10_Small_Softmax 15.7 k 90 1340 20 73.9 91.0
A10_Generic 30.5 k 202 1372 38 73.9 91.0
A10_Performance 57.8 k 650 4387 118 73.9 91.1
AGX7_FP16_Generic 32.1 k 186 2820 38 74.4 91.4
AGX7_FP16_Performance 99.3 k 1162 11985 193 74.4 91.4
AGX7_Small_NoSoftmax 16.3 k 80 4197 37 74.1 91.4
AGX7_Small_Softmax 17.7 k 90 4197 37 74.2 91.3
AGX7_Generic 35.1 k 202 4602 75 74.2 91.3
AGX7_Performance 62.4 k 650 10435 168 74.0 91.4
AGX7_Performance_Giant 117.9 k 1546 8299 268 74.1 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

A10_FP16_Generic 27.8 k 186 430 0.55
A10_FP16_Performance 84.2 k 1162 2147 3.57
AGX7_FP16_Generic 32.1 k 186 812 1.07
AGX7_FP16_Performance 99.3 k 1162 4331 7.20
AGX7_Small_NoSoftmax 16.3 k 80 1133 1.10
AGX7_Small_Softmax 17.7 k 90 1135 1.10
AGX7_Generic 35.1 k 202 1310 2.13
AGX7_Performance 62.4 k 650 3743 6.37
AGX7_Performance_Giant 117.9 k 1546 5672 12.24

public/yolo-v3-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
A10_FP16_Generic 27.8 k 186 718 2.1 62.27 31.58
A10_FP16_Performance 84.2 k 1162 3074 13.5 62.25 31.58
A10_Generic 30.5 k 202 651 4.0 62.07 31.26
A10_Performance 57.8 k 650 1777 11.5 62.25 31.32
AGX7_FP16_Generic 32.1 k 186 1392 4.1 62.27 31.58
AGX7_FP16_Performance 99.3 k 1162 6268 27.6 62.25 31.58
AGX7_Generic 35.1 k 202 1861 8.1 62.28 31.49
AGX7_Performance 62.4 k 650 2695 11.9 62.22 31.47
AGX7_Performance_Giant 117.9 k 1546 5206 31.8 62.25 31.46

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
A10_FP16_Generic 27.8 k 186 548 19 35.79 14.77
A10_FP16_Performance 84.2 k 1162 2280 57 35.81 14.78
A10_Generic 30.5 k 202 773 37 35.76 14.78
A10_Performance 57.8 k 650 1369 43 35.71 14.70
AGX7_FP16_Generic 32.1 k 186 1065 37 35.79 14.77
AGX7_FP16_Performance 99.3 k 1162 4540 113 35.81 14.78
AGX7_Generic 35.1 k 202 2011 68 35.76 14.74
AGX7_Performance 62.4 k 650 1595 40 35.73 14.72
AGX7_Performance_Giant 117.9 k 1546 5234 113 35.81 14.75

public/yolo-v8-nano detection

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
A10_FP16_Performance 84.2 k 1162 3407 53 51.15 36.52
A10_Generic 30.5 k 202 923 20 50.62 36.05
A10_Performance 57.8 k 650 2306 39 50.59 36.03
AGX7_FP16_Performance 99.3 k 1162 6159 96 51.15 36.52
AGX7_Generic 35.1 k 202 2481 51 51.14 36.50
AGX7_Performance 62.4 k 650 6276 100 51.10 36.48

public/yolo-v8-nano classification

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Performance 84.2 k 1162 3306 442 67.92 87.72
A10_Generic 30.5 k 202 572 183 66.06 87.22
A10_Performance 57.8 k 650 993 223 65.94 87.06
AGX7_FP16_Performance 99.3 k 1162 8687 1161 67.92 87.72
AGX7_Generic 35.1 k 202 5604 963 67.96 87.86
AGX7_Performance 62.4 k 650 10355 1384 67.72 87.72

public/squeezenet1.1

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 326 113 58.5 81.1
A10_FP16_Performance 84.2 k 1162 2316 465 58.5 81.1
A10_Small_NoSoftmax 14.6 k 80 371 126 58.9 80.9
A10_Small_Softmax 15.7 k 90 350 119 58.1 81.1
A10_Generic 30.5 k 202 499 274 58.1 81.1
A10_Performance 57.8 k 650 1337 461 58.7 81.1
AGX7_FP16_Generic 32.1 k 186 631 219 58.5 81.1
AGX7_FP16_Performance 99.3 k 1162 4553 915 58.5 81.1
AGX7_Small_NoSoftmax 16.3 k 80 929 221 58.5 81.0
AGX7_Small_Softmax 17.7 k 90 930 222 58.5 81.0
AGX7_Generic 35.1 k 202 1754 545 58.5 81.0
AGX7_Performance 62.4 k 650 2108 424 58.4 81.0
AGX7_Performance_Giant 117.9 k 1546 3732 980 58.3 81.1

public/i3d_rgb_tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 27.8 k 186 218 0.31 65.79 82.89
A10_FP16_Performance 84.2 k 1162 1187 1.92 65.79 82.89
A10_Small_NoSoftmax 14.6 k 80 235 0.32 66.01 83.55
A10_Small_Softmax 15.7 k 90 221 0.31 65.35 83.55
A10_Generic 30.5 k 202 349 0.68 66.23 83.11
A10_Performance 57.8 k 650 1097 1.89 66.67 83.77
AGX7_FP16_Generic 32.1 k 186 438 0.60 65.79 82.89
AGX7_FP16_Performance 99.3 k 1162 2389 3.86 65.79 82.89
AGX7_Small_NoSoftmax 16.3 k 80 491 0.58 65.35 83.11
AGX7_Small_Softmax 17.7 k 90 492 0.58 65.57 83.11
AGX7_Generic 35.1 k 202 745 1.36 65.57 83.11
AGX7_Performance 62.4 k 650 2316 3.74 65.13 83.11
AGX7_Performance_Giant 117.9 k 1546 2685 4.39 65.79 82.89
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.