Intel® FPGA AI Suite: IP Reference Manual

ID 768974
Date 7/03/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the Intel® FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Intel® Arria® 10: 265 MHz
  • Intel Agilex® 7: 400 Hz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the Intel® FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Intel® Arria® 10 runtime host: CentOS7 host on an Intel® Xeon® processor E5-1650 @ 3.6 GHz
  • Intel Agilex® 7 runtime host: SLES12 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth, and, depending on the graph, the host CPU speed.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Intel® Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

Details - Intel FPGA AI Suite V2023.2

Architecture fMAX ALMs DSPs M20Ks Registers
A10_FP16_Generic 327 MHz 25.2 k 162 485 66 k
A10_FP16_Performance 296 MHz 78.1 k 1114 1443 239 k
A10_Small_NoSoftmax 348 MHz 14.5 k 80 247 40 k
A10_Small_Softmax 331 MHz 15.8 k 90 255 43 k
A10_Generic 306 MHz 26.9 k 178 597 72 k
A10_Performance 302 MHz 51.6 k 602 909 156 k
AGX7_FP16_Generic 615 MHz 28. k 162 504 94 k
AGX7_FP16_Performance 600 MHz 91.1 k 1114 1505 310 k
AGX7_Small_NoSoftmax 615 MHz 16.1 k 80 307 56 k
AGX7_Small_Softmax 615 MHz 17.3 k 90 315 62 k
AGX7_Generic 600 MHz 30.1 k 178 765 108 k
AGX7_Performance 570 MHz 54.8 k 602 1221 179 k
AGX7_Performance_NoPrelu_NoEltwise 595 MHz 84. k 1162 2795 295 k

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 1248 94 71.2 89.5
A10_FP16_Performance 78.1 k 1114 4949 306 71.2 89.5
A10_Small_NoSoftmax 14.5 k 80 1151 99 69.8 89.1
A10_Small_Softmax 15.8 k 90 1103 94 69.6 89.0
A10_Generic 26.9 k 178 1244 131 69.6 89.0
A10_Performance 51.6 k 602 2890 323 70.0 88.9
AGX7_FP16_Generic 28. k 162 2295 173 71.2 89.5
AGX7_FP16_Performance 91.1 k 1114 8969 555 71.2 89.5
AGX7_Small_NoSoftmax 16.1 k 80 2781 168 70.8 89.6
AGX7_Small_Softmax 17.3 k 90 2793 168 70.9 89.5
AGX7_Generic 30.1 k 178 4002 237 70.9 89.5
AGX7_Performance 54.8 k 602 6034 380 70.9 89.5
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 11923 501 70.9 89.5

public/mobilenet-v2

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 2098 85 71.8 89.6
A10_FP16_Performance 78.1 k 1114 3861 201 71.7 89.6
A10_Small_NoSoftmax 14.5 k 80 2426 86 70.1 88.6
A10_Small_Softmax 15.8 k 90 2349 83 70.0 88.7
A10_Generic 26.9 k 178 1067 107 70.0 88.7
A10_Performance 51.6 k 602 2324 213 69.6 88.3
AGX7_FP16_Generic 28. k 162 3691 150 71.8 89.6
AGX7_FP16_Performance 91.1 k 1114 7095 369 71.7 89.6
AGX7_Small_NoSoftmax 16.1 k 80 4522 139 71.7 89.6
AGX7_Small_Softmax 17.3 k 90 4551 139 71.8 89.5
AGX7_Generic 30.1 k 178 3290 190 71.8 89.5
AGX7_Performance 54.8 k 602 5725 268 71.7 89.4
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 9669 273 71.7 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 2309 68 74.8 91.9
A10_FP16_Performance 78.1 k 1114 5303 170 74.9 91.8
A10_Generic 26.9 k 178 1751 83 73.1 90.9
A10_Performance 51.6 k 602 3304 183 72.4 90.4
AGX7_FP16_Generic 28. k 162 4117 122 74.8 91.9
AGX7_FP16_Performance 91.1 k 1114 8964 288 74.9 91.8
AGX7_Generic 30.1 k 178 4456 139 74.7 91.8
AGX7_Performance 54.8 k 602 7241 233 74.7 91.7
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 11751 251 74.7 91.7

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 2161 85 75.8 92.1
A10_FP16_Performance 78.1 k 1114 13061 29 75.8 92.1
AGX7_FP16_Generic 28. k 162 3782 149 75.8 92.1
AGX7_FP16_Performance 91.1 k 1114 17695 40 75.8 92.1
AGX7_Generic 30.1 k 178 2395 64 72.3 90.7
AGX7_Performance 54.8 k 602 4205 10 72.3 90.5

public/resnet-50-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 1664 18 76.8 92.9
A10_FP16_Performance 78.1 k 1114 6868 97 76.8 92.9
A10_Small_NoSoftmax 14.5 k 80 2041 17 76.6 92.7
A10_Small_Softmax 15.8 k 90 1947 16 76.4 92.6
A10_Generic 26.9 k 178 1454 32 76.4 92.6
A10_Performance 51.6 k 602 4654 104 76.6 92.7
AGX7_FP16_Generic 28. k 162 3074 32 76.8 92.9
AGX7_FP16_Performance 91.1 k 1114 11525 163 76.8 92.9
AGX7_Small_NoSoftmax 16.1 k 80 5970 28 77.0 92.9
AGX7_Small_Softmax 17.3 k 90 5970 28 77.0 92.9
AGX7_Generic 30.1 k 178 4387 60 77.0 92.9
AGX7_Performance 54.8 k 602 10136 143 76.9 92.8
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 13722 208 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 1563 21 74.4 91.4
A10_FP16_Performance 78.1 k 1114 7263 116 74.4 91.4
A10_Small_NoSoftmax 14.5 k 80 1431 21 73.9 91.2
A10_Small_Softmax 15.8 k 90 1364 20 73.8 91.2
A10_Generic 26.9 k 178 1471 38 73.8 91.2
A10_Performance 51.6 k 602 4757 128 74.2 91.2
AGX7_FP16_Generic 28. k 162 2889 39 74.4 91.4
AGX7_FP16_Performance 91.1 k 1114 12028 193 74.4 91.4
AGX7_Small_NoSoftmax 16.1 k 80 4186 37 74.1 91.4
AGX7_Small_Softmax 17.3 k 90 4189 37 74.2 91.3
AGX7_Generic 30.1 k 178 4704 72 74.2 91.3
AGX7_Performance 54.8 k 602 10259 164 74.0 91.3
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 14336 228 74.0 91.3

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

A10_FP16_Generic 25.2 k 162 459 0.59
A10_FP16_Performance 78.1 k 1114 2250 3.74
AGX7_FP16_Generic 28. k 162 833 1.10
AGX7_FP16_Performance 91.1 k 1114 4335 7.21
AGX7_Small_NoSoftmax 16.1 k 80 1135 1.10
AGX7_Small_Softmax 17.3 k 90 1134 1.10
AGX7_Generic 30.1 k 178 1283 2.09
AGX7_Performance 54.8 k 602 1894 3.22
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 6066 8.44

public/yolo-v3-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

COCO AP mAP
A10_FP16_Generic 25.2 k 162 767 2.3 31.58 62.27
A10_FP16_Performance 78.1 k 1114 3232 14.2 31.58 62.25
A10_Generic 26.9 k 178 666 4.1 31.26 62.07
A10_Performance 51.6 k 602 1932 12.5 31.32 62.25
AGX7_FP16_Generic 28. k 162 1425 4.2 31.58 62.27
AGX7_FP16_Performance 91.1 k 1114 6262 27.5 31.58 62.25
AGX7_Generic 30.1 k 178 1818 7.9 31.48 62.21
AGX7_Performance 54.8 k 602 2649 11.7 31.47 62.22

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

COCO AP mAP
A10_FP16_Generic 25.2 k 162 583 20 14.77 35.79
A10_FP16_Performance 78.1 k 1114 2401 60 14.78 35.81
A10_Generic 26.9 k 178 790 37 14.78 35.76
A10_Performance 51.6 k 602 1500 48 14.70 35.71
AGX7_FP16_Generic 28. k 162 1095 38 14.77 35.79
AGX7_FP16_Performance 91.1 k 1114 4558 113 14.78 35.81
AGX7_Generic 30.1 k 178 1989 67 14.74 35.76
AGX7_Performance 54.8 k 602 1569 39 14.72 35.73

public/squeezenet1.1

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 1043 117 58.5 81.1
A10_FP16_Performance 78.1 k 1114 8134 289 58.5 81.1
A10_Small_NoSoftmax 14.5 k 80 746 126 58.9 81.0
A10_Small_Softmax 15.8 k 90 716 120 58.1 81.1
A10_Generic 26.9 k 178 12219 62 58.1 81.1
A10_Performance 51.6 k 602 5446 375 58.8 81.1
AGX7_FP16_Generic 28. k 162 1904 214 58.5 81.1
AGX7_FP16_Performance 91.1 k 1114 12682 450 58.5 81.1
AGX7_Small_NoSoftmax 16.1 k 80 2140 211 58.5 81.0
AGX7_Small_Softmax 17.3 k 90 2155 211 58.5 81.0
AGX7_Generic 30.1 k 178 17918 46 58.5 81.0
AGX7_Performance 54.8 k 602 9152 325 58.4 81.0
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 14731 262 58.4 81.0

public/i3d_rgb_tf

Architecture ALMs DSPs DDR1

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

A10_FP16_Generic 25.2 k 162 117 0.17 65.79 82.89
A10_FP16_Performance 78.1 k 1114 1268 2.01 65.79 82.89
A10_Small_NoSoftmax 14.5 k 80 119 0.16 66.01 83.77
A10_Small_Softmax 15.8 k 90 113 0.16 66.23 83.11
A10_Generic 26.9 k 178 178 0.34 66.23 83.11
A10_Performance 51.6 k 602 602 1.03 66.67 83.77
AGX7_FP16_Generic 28. k 162 224 0.31 65.79 82.89
AGX7_FP16_Performance 91.1 k 1114 2432 3.87 65.79 82.89
AGX7_Small_NoSoftmax 16.1 k 80 245 0.29 65.35 83.11
AGX7_Small_Softmax 17.3 k 90 245 0.29 65.57 83.11
AGX7_Generic 30.1 k 178 364 0.66 65.57 83.11
AGX7_Performance 54.8 k 602 582 0.93 65.13 83.11
AGX7_Performance_NoPrelu_NoEltwise 84. k 1162 1947 2.39 65.13 83.11
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.