5.6.1. Inference on Image Classification Graphs
The dla_benchmark demonstration application runs five inference requests (batches) in parallel on the FPGA, by default, to achieve optimal system performance. To measure steady state performance, you should run multiple batches (using the niter flag) because the first iteration is significantly slower with FPGA devices.
The dla_benchmark demonstration application also supports multiple graphs in the same execution. You can place more than one graphs or compiled graphs as input, separated by commas.
Each graph can have either a different input dataset or use a commonly shared dataset among all graphs. Each graph requires an individual ground_truth_file file, separated by commas. If some ground_truth_file files are missing, the dla_benchmark continues to run and ignore the missing ones.
When multi-graph is enabled, the -niter flag represents the number of iterations for each graph, so the total number of iterations becomes -niter × size of graphs.
The dla_benchmark demonstration application switches graphs after submitting -nireq requests. The request queue holds the number of requests up to -nireq × size of graphs. This limit is constrained by the DMA CSR descriptor queue size (64 per hardware instance).
The board you use determines the number of instances that you can compile the Intel® FPGA AI Suite hardware for:
- For the Intel® PAC with Intel® Arria® 10 GX FPGA, you can compile up to two instances with the same architecture on all instances.
- For the Terasic* DE10-Agilex Development Board, you can compile up to four instances with the same architecture on all instances.
Each instance accesses one of the two DDR banks and executes the graph independently. This optimization enables two batches to run in parallel. Each inference request created by the demonstration application is assigned to one of the instances in the FPGA plugin.
To ensure that batches are evenly distributed between the instances, you must choose an inference request batch size that is a multiple of the number of Intel® FPGA AI Suite instances. For example, with two instances, specify the batch size as six (instead of the OpenVINO™ default of five) to ensure that the experiment meets this requirement.
The following example usage assumes that a Model Optimizer IR .xml file has been placed in demo/models/public/resnet-50-tf/FP32/. It also assumes that an image set has been placed into demo/sample_images/. Lastly, it assumes that has been programmed with a bitstream corresponding to A10_Performance.arch.
binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32 imgdir=$COREDLA_ROOT/demo/sample_images cd $COREDLA_ROOT/runtime/build_Release ./dla_benchmark/dla_benchmark \ -b=1 \ -m $binxml/resnet-50-tf.xml \ -d=HETERO:FPGA,CPU \ -i $imgdir \ -niter=5 \ -plugins_xml_file ./plugins.xml \ -arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \ -api=async \ -groundtruth_loc $imgdir/TF_ground_truth.txt \ -perf_est \ -nireq=4 \ -bgr
The following example shows how the IP can dynamically swap between graphs. This example usage assumes that another Model Optimizer IR .xml file has been placed in demo/models/public/resnet-101-tf/FP32/. It also assumes that another image set has been placed into demo/sample_images_rn101/. In this case, dla_benchmark only evaluates the classification accuracy of Resnet50 because we did not provide ground truth for the second graph (ResNet101).
binxml1=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32 binxml2=$COREDLA_ROOT/demo/models/public/resnet-101-tf/FP32 imgdir1=$COREDLA_ROOT/demo/sample_images imgdir2=$COREDLA_ROOT/demo/sample_images_rn101 cd $DEVELOPER_PACKAGE_ROOT/runtime/build_Release ./dla_benchmark/dla_benchmark \ -b=1 \ -m $binxml1/resnet-50-tf.xml,$binxml2/resnet-101-tf.xml \ -d=HETERO:FPGA,CPU \ -i $imgdir1,$imgdir2 \ -niter=5 \ -plugins_xml_file ./plugins.xml \ -arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \ -api=async \ -groundtruth_loc $imgdir1/TF_ground_truth.txt \ -perf_est \ -nireq=4 \ -bgr