FPGA AI Suite: Compiler Reference Manual

ID 768972
Date 12/16/2024
Public
Document Table of Contents

3.5.1. Generating an Architecture for Highest Performance

To generate an architecture that is optimized for a graph or set of graphs, the FPGA AI Suite architecture optimizer uses a base architecture and modifies parameters to achieve the highest throughput in frames per second (fps). The best architecture is saved as an architecture description file with a file name based on the architecture parameters.

Some parameters (such as the precision specified in the .arch file) are not modified during optimization. Parameters that are not optimized take their value from the input architecture file specified by the required --march option. Important IP parameters to set manually include the arch_precision, num_interleaved_features, and num_interleaved_filters parameters. IP parameters are described in the FPGA AI Suite IP Reference Manual .

Architecture optimization is a slow process. The optimizer stores the intermediate architecture iterations in the arch_gen_reports/ directory.

When you use the --mtarget-fps option, the architecture optimizer might use in excess of 128GB of memory. The required memory varies significantly depending on the graph. For example, a ResNet 50-type graph can often require up to 256 GB of memory.

If this memory is not available, the operating system might stop the dla_compiler process, and the user shell prints the message " Killed ". For information about how to use the COREDLA_TARGET_FPS_THREAD_LIMIT environment variable to control the resource consumption of the architecture optimizer, refer to the description of --mtarget-fps in Architecture Optimizer Options (dla_compiler Command Options).

If multiple input graphs are specified by using multiple --network-file options, then the optimizer calculates a weighted objective. For example, if two graphs are specified and have throughput values of fps1 and fps2 then the overall throughput is maximized using the user-specified weights w1 and w2 as follows:

The architecture optimization process uses the compiler, the performance estimator, and the area estimator. Accordingly, its command line options are as follows:

Option Description
--gen-arch [Required] Enable the architecture optimizer.
--mmax-resources= <max_ALMs> , < max_M20K_blocks> , <max_DSP_blocks> [Optional] Sets the maximum number of resources that the output architecture can use (as estimated by the area estimator).

Specify the resources as a comma-delimited sequence of max ALMs, M20k blocks, and DSP blocks.

If you do not specify this option, the architecture uses as many resources as needed. The optimizer might use more resources than are available on the FPGA device.

--mmax-resources-alm-util= <%_max_ALM_utilization> [Optional] Set this option only if you also set the --mmax-resources option.

This option sets a percentage target of the <max_ALMs> value from the --mmax-resources option that architecture optimizer aims to use for the architecture. The remaining ALMs are used by the Quartus® Prime software to improve timing closure.

The default value is 100 (100% utilization).

Designers typically target full-chip logic utilization values lower than 100 (100% utilization) to improve routing and timing closure of the design.

As an example, a target of 80% ALM utilization (--mmax-resources-alm-util=80) tells the architecture optimizer that it should use only 80% of the ALMs that were specified by the --mmax-resources option. The other 20% are used by Quartus® Prime software to improve timing closure.

--mtarget-fps [Optional] Sets the minimum frames-per-second (fps) that the output architecture must achieve (as estimated by the performance estimator).

This option significantly increases the runtime and memory requirements of the architecture optimizer. The architecture optimizer can use in excess of 128GB of RAM when using this option, and thus requires a server-class machine. The required memory varies significantly depending on the graph. For example, a ResNet 50-type graph can often require up to 256 GB of memory

--interleave-search [Optional] Causes the architecture optimizer to evaluate different legal feature and filter interleave options. This significantly increases the run time of the architecture optimizer, by as much as 50%.

Due to the run time penalty, after an optimal interleave has been found for a given graph, the recommendation is to place the optimized interleave into the initial input .arch file and avoid using --interleave-search during any future fine-tuning.

For more information, refer to "Parameter group: pe_array" topic of FPGA AI Suite IP Reference Manual .

--gen-min-sb [Optional] Minimum size stream buffer. Can be specified explicitly to reduce the search space for the optimizer.

For more information about stream buffer size, refer to the "Parameter: stream_buffer_depth " section in the "Parameter group: Global Parameters" topic of FPGA AI Suite IP Reference Manual .

--network-weightings=" <network_weight_1> <network_weight_2> <network_weight_3>...<network_weight_n> " [Optional] Space-delimited specification of network weights when multiple networks are specified. If not specified, then all networks are equally weighted.
--gen-arch-file [Optional] Name of the output architecture (.arch) file.
--max-archsetM-percentage

--arch-limit

--time-limit

[Optional] Used when performing a larger search of the optimization space, as described in dla_compiler --help. Not recommended for general use.

The architecture optimizer supports only 1xN and Nx1 interleave, as described in the release notes.

For more information about modifying interleaving, refer to the "Parameters: pe_array/num_interleaved_features, pe_array/ num_interleaved_filters " section in the "Parameter Group: pe_array " topic of FPGA AI Suite IP Reference Manual .

The simplest command format to optimize an Architecture Description (.arch) file for a graph is as follows:
dla_compiler \
   --gen-arch \
   --mmax-resources=<max_ALMs>,<max_M20K_blocks>,<max_DSP_blocks> \
   --network-file <path or paths to graph.xml> \
   --march=<path to input .arch file>

Example Command

dla_compiler \
   --gen-arch \
   --mmax-resources=427200,2713,1518 \ 
   --gen-min-sb=2048 \
   --network-file ResNet50.xml Mobilenet_v1.xml \
   --march=./example_architecture/A10_Performance.arch \
   --mmax-resources-alm-util=75 \  
   --fassumed-fmax-core=300 \ 
   --network-weightings=1 2