Intel® FPGA AI Suite: IP Reference Manual

ID 768974
Date 12/01/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the Intel® FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Intel® Arria® 10: 265 MHz
  • Intel Agilex® 7: 400 Hz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the Intel® FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Intel® Arria® 10 runtime host: CentOS7 host on an Intel® Xeon® processor E5-1650 @ 3.6 GHz
  • Intel Agilex® 7 runtime host: SLES12 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth, and, depending on the graph, the host CPU speed.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Intel® Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

Details - Intel FPGA AI Suite V2023.3

Architecture fMAX ALMs DSPs M20Ks Registers
A10_FP16_Generic 324 MHz 26. k 162 491 68 k
A10_FP16_Performance 276 MHz 80.7 k 1114 1469 244 k
A10_Small_NoSoftmax 346 MHz 14.7 k 80 247 40 k
A10_Small_Softmax 347 MHz 16.1 k 90 255 43 k
A10_Generic 298 MHz 28.7 k 178 610 75 k
A10_Performance 301 MHz 54.1 k 602 935 161 k
AGX7_FP16_Generic 600 MHz 29.3 k 162 510 96 k
AGX7_FP16_Performance 600 MHz 94.3 k 1114 1531 314 k
AGX7_Small_NoSoftmax 616 MHz 16.6 k 80 307 54 k
AGX7_Small_Softmax 618 MHz 17.9 k 90 315 64 k
AGX7_Generic 610 MHz 32.6 k 178 776 117 k
AGX7_Performance 568 MHz 56.5 k 602 1277 194 k
AGX7_Performance_NoPrelu_NoEltwise 585 MHz 87.4 k 1162 2795 298 k

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 1237 93 71.2 89.5
A10_FP16_Performance 80.7 k 1114 4661 288 71.2 89.5
A10_Small_NoSoftmax 14.7 k 80 1145 98 69.8 89.1
A10_Small_Softmax 16.1 k 90 1153 98 69.6 89.1
A10_Generic 28.7 k 178 1212 128 69.6 89.1
A10_Performance 54.1 k 602 2881 322 70.0 89.0
AGX7_FP16_Generic 29.3 k 162 2242 169 71.2 89.5
AGX7_FP16_Performance 94.3 k 1114 8954 554 71.2 89.5
AGX7_Small_NoSoftmax 16.6 k 80 2789 169 70.9 89.6
AGX7_Small_Softmax 17.9 k 90 2809 169 70.9 89.5
AGX7_Generic 32.6 k 178 4072 241 70.9 89.5
AGX7_Performance 56.5 k 602 6211 391 70.9 89.5
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 11156 469 70.9 89.5

public/mobilenet-v2

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 2081 84 71.8 89.6
A10_FP16_Performance 80.7 k 1114 3644 189 71.8 89.6
A10_Small_NoSoftmax 14.7 k 80 2415 86 70.1 88.6
A10_Small_Softmax 16.1 k 90 2442 86 70.0 88.6
A10_Generic 28.7 k 178 1041 104 70.0 88.6
A10_Performance 54.1 k 602 2316 212 69.6 88.3
AGX7_FP16_Generic 29.3 k 162 3613 146 71.8 89.6
AGX7_FP16_Performance 94.3 k 1114 7100 369 71.8 89.6
AGX7_Small_NoSoftmax 16.6 k 80 4535 139 71.6 89.6
AGX7_Small_Softmax 17.9 k 90 4565 140 71.8 89.4
AGX7_Generic 32.6 k 178 3319 192 71.8 89.4
AGX7_Performance 56.5 k 602 5780 271 71.7 89.4
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 10036 279 71.7 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 2290 68 74.8 91.9
A10_FP16_Performance 80.7 k 1114 5024 161 74.8 91.9
A10_Generic 28.7 k 178 1711 81 73.2 90.9
A10_Performance 54.1 k 602 3294 182 72.4 90.3
AGX7_FP16_Generic 29.3 k 162 4032 119 74.8 91.9
AGX7_FP16_Performance 94.3 k 1114 8969 288 74.8 91.9
AGX7_Generic 32.6 k 178 4509 141 74.7 91.8
AGX7_Performance 56.5 k 602 7312 236 74.7 91.8
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 11641 249 74.7 91.8

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 2139 83 75.8 92.1
A10_FP16_Performance 80.7 k 1114 12734 28 75.8 92.1
AGX7_FP16_Generic 29.3 k 162 3699 143 75.8 92.1
AGX7_FP16_Performance 94.3 k 1114 17615 39 75.8 92.1
AGX7_Generic 32.6 k 178 5095 135 72.3 90.7
AGX7_Performance 56.5 k 602 17561 39 72.3 90.5

public/resnet-50-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 1650 17 76.8 92.9
A10_FP16_Performance 80.7 k 1114 6557 92 76.8 92.9
A10_Small_NoSoftmax 14.7 k 80 2030 17 76.6 92.7
A10_Small_Softmax 16.1 k 90 2037 17 76.4 92.6
A10_Generic 28.7 k 178 1418 31 76.4 92.6
A10_Performance 54.1 k 602 4650 104 76.5 92.7
AGX7_FP16_Generic 29.3 k 162 3003 32 76.8 92.9
AGX7_FP16_Performance 94.3 k 1114 11546 163 76.8 92.9
AGX7_Small_NoSoftmax 16.6 k 80 5983 28 77.0 92.9
AGX7_Small_Softmax 17.9 k 90 6001 28 77.1 92.9
AGX7_Generic 32.6 k 178 4452 60 77.1 92.9
AGX7_Performance 56.5 k 602 10072 142 76.9 92.9
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 13490 205 76.9 92.9

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 1549 21 74.4 91.4
A10_FP16_Performance 80.7 k 1114 6958 111 74.4 91.4
A10_Small_NoSoftmax 14.7 k 80 1423 21 73.8 91.2
A10_Small_Softmax 16.1 k 90 1428 21 73.9 91.0
A10_Generic 28.7 k 178 1434 37 73.9 91.0
A10_Performance 54.1 k 602 4736 127 73.9 91.1
AGX7_FP16_Generic 29.3 k 162 2822 38 74.4 91.4
AGX7_FP16_Performance 94.3 k 1114 11937 191 74.4 91.4
AGX7_Small_NoSoftmax 16.6 k 80 4197 37 74.1 91.4
AGX7_Small_Softmax 17.9 k 90 4213 37 74.2 91.3
AGX7_Generic 32.6 k 178 4775 73 74.2 91.3
AGX7_Performance 56.5 k 602 10361 166 74.0 91.4
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 14296 228 74.0 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

A10_FP16_Generic 26. k 162 455 0.58
A10_FP16_Performance 80.7 k 1114 2113 3.51
AGX7_FP16_Generic 29.3 k 162 812 1.07
AGX7_FP16_Performance 94.3 k 1114 4307 7.16
AGX7_Small_NoSoftmax 16.6 k 80 1136 1.10
AGX7_Small_Softmax 17.9 k 90 1139 1.10
AGX7_Generic 32.6 k 178 1298 2.11
AGX7_Performance 56.5 k 602 3670 6.24
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 5908 8.22

public/yolo-v3-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

COCO AP mAP
A10_FP16_Generic 26. k 162 758 2.3 31.58 62.27
A10_FP16_Performance 80.7 k 1114 3026 13.3 31.58 62.25
A10_Generic 28.7 k 178 648 4.0 31.26 62.07
A10_Performance 54.1 k 602 1910 12.4 31.32 62.25
AGX7_FP16_Generic 29.3 k 162 1391 4.1 31.58 62.27
AGX7_FP16_Performance 94.3 k 1114 6248 27.5 31.58 62.25
AGX7_Generic 32.6 k 178 1842 8.0 31.49 62.28
AGX7_Performance 56.5 k 602 2640 11.6 31.47 62.22

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

COCO AP mAP
A10_FP16_Generic 26. k 162 576 20 14.77 35.79
A10_FP16_Performance 80.7 k 1114 2244 56 14.78 35.81
A10_Generic 28.7 k 178 766 36 14.78 35.76
A10_Performance 54.1 k 602 1512 48 14.70 35.71
AGX7_FP16_Generic 29.3 k 162 1074 37 14.77 35.79
AGX7_FP16_Performance 94.3 k 1114 4539 113 14.78 35.81
AGX7_Generic 32.6 k 178 2007 68 14.74 35.76
AGX7_Performance 56.5 k 602 1570 39 14.72 35.73

public/squeezenet1.1

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 1034 116 58.5 81.1
A10_FP16_Performance 80.7 k 1114 7827 278 58.5 81.1
A10_Small_NoSoftmax 14.7 k 80 742 125 58.9 80.9
A10_Small_Softmax 16.1 k 90 749 126 58.1 81.1
A10_Generic 28.7 k 178 12149 62 58.1 81.1
A10_Performance 54.1 k 602 5432 374 58.7 81.1
AGX7_FP16_Generic 29.3 k 162 1861 209 58.5 81.1
AGX7_FP16_Performance 94.3 k 1114 12364 439 58.5 81.1
AGX7_Small_NoSoftmax 16.6 k 80 2143 211 58.5 81.0
AGX7_Small_Softmax 17.9 k 90 2165 212 58.5 81.0
AGX7_Generic 32.6 k 178 17942 46 58.5 81.0
AGX7_Performance 56.5 k 602 9063 321 58.4 81.0
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 14917 265 58.4 81.0

public/i3d_rgb_tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 26. k 162 231 0.33 65.79 82.89
A10_FP16_Performance 80.7 k 1114 1193 1.89 65.79 82.89
A10_Small_NoSoftmax 14.7 k 80 235 0.32 65.57 83.99
A10_Small_Softmax 16.1 k 90 236 0.33 66.01 83.55
A10_Generic 28.7 k 178 347 0.67 66.23 83.11
A10_Performance 54.1 k 602 1200 2.05 66.67 83.77
AGX7_FP16_Generic 29.3 k 162 438 0.60 65.79 82.89
AGX7_FP16_Performance 94.3 k 1114 2434 3.87 65.79 82.89
AGX7_Small_NoSoftmax 16.6 k 80 491 0.58 65.35 82.89
AGX7_Small_Softmax 17.9 k 90 493 0.58 65.57 83.11
AGX7_Generic 32.6 k 178 738 1.33 65.57 83.11
AGX7_Performance 56.5 k 602 2320 3.69 65.13 83.11
AGX7_Performance_NoPrelu_NoEltwise 87.4 k 1162 3857 4.73 65.13 83.11
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.