FPGA AI Suite: PCIe-based Design Example User Guide

ID 768977
Date 7/31/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.5.1. Inference on Image Classification Graphs

The demonstration application requires the OpenVINO™ device flag to be either HETERO:FPGA,CPU for heterogeneous execution or HETERO:FPGA for FPGA-only execution.

The dla_benchmark demonstration application runs five inference requests (batches) in parallel on the FPGA, by default, to achieve optimal system performance. To measure steady state performance, you should run multiple batches (using the niter flag) because the first iteration is significantly slower with FPGA devices.

The dla_benchmark demonstration application also supports multiple graphs in the same execution. You can place more than one graphs or compiled graphs as input, separated by commas.

Each graph can have either a different input dataset or use a commonly shared dataset among all graphs. Each graph requires an individual ground_truth_file file, separated by commas. If some ground_truth_file files are missing, the dla_benchmark continues to run and ignore the missing ones.

When multi-graph is enabled, the -niter flag represents the number of iterations for each graph, so the total number of iterations becomes -niter × number of graphs.

The dla_benchmark demonstration application switches graphs after submitting -nireq requests. The request queue holds the number of requests up to -nireq × number of graphs. This limit is constrained by the DMA CSR descriptor queue size (64 per hardware instance).

The board you use determines the number of instances that you can compile the FPGA AI Suite hardware for:

  • For the Terasic* DE10-Agilex Development Board, you can compile up to four instances with the same architecture on all instances.

Each instance accesses one of the DDR banks on the board and executes the graph independently. This optimization enables multiple batches to run in parallel, limited by the number of DDR banks available. Each inference request created by the demonstration application is assigned to one of the instances in the FPGA plugin.

To ensure that batches are evenly distributed between the instances, you must choose an inference request batch size that is a multiple of the number of FPGA AI Suite instances. For example, with two instances, specify the batch size as six (instead of the OpenVINO™ default of five) to ensure that the experiment meets this requirement.

The example usage that follows has the following assumptions:
  • A Model Optimizer IR .xml file is in demo/models/public/resnet-50-tf/FP32/
  • An image set is in demo/sample_images/
  • The board is programmed with a bitstream that corresponds to AGX7_Performance.arch
binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
imgdir=$COREDLA_ROOT/demo/sample_images
cd $COREDLA_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
   -b=1 \
   -m $binxml/resnet-50-tf.xml \
   -d=HETERO:FPGA,CPU \
   -i $imgdir \
   -niter=4 \
   -plugins ./plugins.xml \
   -arch_file $COREDLA_ROOT/example_architectures/AGX7_Performance.arch \
   -api=async \
   -groundtruth_loc $imgdir/TF_ground_truth.txt \
   -perf_est \
   -nireq=8 \
   -bgr

The following example shows how the FPGA AI Suite IP can dynamically swap between graphs. This example usage assumes that another Model Optimizer IR .xml file has been placed in demo/models/public/resnet-101-tf/FP32/. It also assumes that another image set has been placed into demo/sample_images_rn101/. In this case, dla_benchmark only evaluates the classification accuracy of Resnet50 because we did not provide ground truth for the second graph (ResNet101).

binxml1=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
binxml2=$COREDLA_ROOT/demo/models/public/resnet-101-tf/FP32
imgdir1=$COREDLA_ROOT/demo/sample_images
imgdir2=$COREDLA_ROOT/demo/sample_images_rn101
cd $DEVELOPER_PACKAGE_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
   -b=1 \
   -m $binxml1/resnet-50-tf.xml,$binxml2/resnet-101-tf.xml \
   -d=HETERO:FPGA,CPU \
   -i $imgdir1,$imgdir2 \
   -niter=8 \
   -plugins ./plugins.xml \
   -arch_file $COREDLA_ROOT/example_architectures/AGX7_Performance.arch \
   -api=async \
   -groundtruth_loc $imgdir1/TF_ground_truth.txt \
   -perf_est \
   -nireq=8 \
   -bgr