2.5.2.1. Parameter Group: Global Parameters

Parameter: `family`

This parameter specifies the target FPGA device family for the architecture.

Legal Values

Table 3. Valid Values for `family` Global Parameter
Value	Description
`A10`	Target Arria® 10 devices.
`AGX5`	Target Agilex™ 5 devices.
`AGX7`	Target Agilex™ 7 devices.
`C10`	Target Cyclone® 10 devices.
`S10`	Target Stratix® 10 devices.

Parameter: `k_vector`

This parameter, also called KVEC, describes the number of filters that the PE Array is able to process simultaneously.

Typically the architecture optimizer is used to set this parameter.

Legal values:

[4-128]

The k_vector value must be a multiple of the c_vector value.
The k_vector value must be divisible by the xbar_k_vector and auxiliary k_vector values.
When you use the depthwise module, the k_vector value must equal the c_vector value.

Parameter: `c_vector`

This parameter, also called CVEC, describes the size of the dot product within each PE in the PE Array.

Typically the architecture optimizer is used to set this parameter.

Legal values:

[4,8,16,32,64]

When you use the depthwise module, the c_vector value must equal the k_vector value.

Parameter: `num_lanes`

This parameter describes how many output-height slices the architecture can compute in parallel.

Using the num_lanes architecture parameter has the following effects:

Setting the num_lanes parameter scales the PE array in the FPGA AI Suite IP by the given number and provides additional parallelism at the cost of more DSPs and area.
The total stream buffer size scales with the num_lanes parameter. Because the feature surface of a graph is divided across multiple lanes, adjust the stream_buffer_depth parameter listed in the .arch file by the inverse of the num_lanes parameter value. For example, a 4-lane architecture with 10k stream buffer depth indicates a 40k total stream buffer size.

When the value of the num_lanes parameter of architecture is greater than 1, the architecture is subject to the following limitations:

All c_vector and k_vector values in the architecture must be the same
The softmax auxiliary module is not supported.

Legal values:: [1,2,4]

Parameter: `arch_precision`

This parameter sets the precision (in bits) of the internal numeric representation used by FPGA AI Suite IP. Lower values increase fps and reduce area, but at the cost of inference accuracy.

Each internal precision option corresponds to a different number of sign and mantissa bits, and uses either two's complement or sign+magnitude. For details, refer to the table in FPGA AI Suite IP Block Configuration.

The FP16 precision significantly increases the size of the resulting IP, but can improve accuracy (particularly in models that have not been retrained for low precision).

All numeric options use block floating point format. In block floating point format, each block of size CVEC shares a common exponent. Both CVEC (c_vector) and arch_precision affect the accuracy of the inference. However, the impact of c_vector is generally small, while the impact of the arch_precision setting is relatively large.

The block floating point format used by the FPGA AI Suite IP is directly compatible with graphs that use INT8 symmetric quantization. INT8 symmetric quantization requires that all operations going from floating point to integer, and vice versa, require only scaling (multiplication or division). When given a graph with INT8 weights, the FPGA AI Suite compiler sets the exponent of the block floating point weights so the original INT8 weights can be used directly as the mantissa. This setting limits the use of INT8 weights to architectures where the mantissa is 8-bits or larger.

The use of INT8 graphs does not significantly affect either the inference speed or the FPGA resource consumption. All inference, regardless of whether block floating point is used, is performed with the same hardware.

In addition to selecting a compatible numeric precision, set the pe_array/enable_scale parameter to true in order to support graphs with INT8 quantization.

The example architectures that are included with the FPGA AI Suite are already set to the recommended arch_precision parameter values for their supported FPGA family. In some cases, it is useful to select a different arch_precision value. FP11 is the lowest precision option, but requires the least number of RAM blocks, and slightly reduces the amount of external memory traffic. The FP12AGX significantly reduces the number of DSPs required to implement the PE array, but logic utilization may increase.

Agilex™ 5 devices implement enhanced DSPs with AI tensor blocks. To take advantage of AI tensor blocks, set the arch_precision value to FP12AGX or FP11 and use interleave values of 12x1.

For more details about the block floating point format, refer to the Low-Precision Networks for Efficient Inference on FPGAs white paper.

Legal values:

FPGA Device Family	Supported `arch_precision` Values
Agilex™ 5	FP11 FP12AGX FP13AGX FP16 (less common)
Agilex™ 7	FP11 FP13AGX FP16 (less common)
Arria® 10	FP11 FP16 (less common)
Cyclone® 10 GX	FP11 FP16 (less common)
Stratix® 10	FP11 FP16 (less common)

Table 4. Multiplication Operations per DSP
Precision	FPGA Device Family
Precision	Arria® 10 Cyclone® 10 GX Stratix® 10	Agilex™ 7	Agilex™ 5
FP11	4	4	20 ⁴
FP12AGX	–	–	20⁴
FP13AGX	–	4	4^**
FP16	2	2	2

INT8 symmetric quantization is enabled by FP12AGX and higher precision options.

The total number of multipliers required by the PE Array will be equal to CVEC * KVEC * num_lanes. Due to quantization, this calculation underpredicts the number of DSPs required when using Agilex™ 5 DSP tensor mode. In addition, the PE Array requires KVEC * num_lanes DSPs to build the FP32 accumulators.

Parameter: `stream_buffer_depth`

This parameter controls the depth of the stream buffer. The stream buffer is used as the on-chip cache for feature (image) data. Larger values increase area (logic and block RAM) but also increase performance.

Typically the architecture optimizer is used to set this parameter.

Legal values:: [2048-262144]

Parameter: `enable_eltwise_mult`

This parameter enables the Elementwise multiplication layer. This layer is required for MobileNetV3.

Parameters: `filter_size_width_max`, `filter_size_height_max`

These parameters determine the maximum size of a convolution filter, which also relates the maximum window size for Average Pool.

The maximum window size for Average Pool is no larger than the value determined by the following formula: $\min (filter_size_width_max, file_size_height_max) - 1$ . In addition, the Average Pool window size may be limited by the filter_scratchpad and filter_depth parameters.

Legal values:: [14,28]

Parameters: `output_image_height_max`, `output_image_width_max`, `output_channels_max`

These parameters control the maximum size of the output tensor.

The default maximum size is 128x128, with up to 8192 channels

Parameter: `enable_debug`

This parameter toggles the FPGA AI Suite debug network to allow forwarding of read requests from the CSR to one of many externally-attached debug-capable modules.

Generally not required for production architectures.

Legal values:: [true,false]

Parameter: `enable_layout_transform`

The parameter enables the dedicated input tensor transform module in the FPGA AI Suite IP. When enabled, the dedicated layout transform hardware transforms the input tensor format and folds the inputs into channels.

You can use the layout transform in streaming and non-streaming configurations of the FPGA AI Suite IP. The transform is particularly useful for doing fast and deterministic tensor preprocessing in hostless applications, or applications where the hard-processor is slow or highly loaded.

However, the layout transform comes with an FPGA area cost that scales mainly with the input data bus width, maximum tensor/stride dimensions, and CVEC. In particular, instances where the value of max_stride_width × max_stride_height × max_channels is greater than the CVEC value consume significant memory resources due to the buffer space required for the overflowing CVEC.

In graphs where the first convolution stride dimensions are unity, no folding can be done, and the layout transform cannot optimize the layout of the input tensor. In such a case, try doing a lighter-weight transformation operation outside of the FPGA AI Suite IP.

When this parameter is enabled, configure the transform with the parametersas described in Parameter Group: layout_transform_params. For information about the layout transformation operation and hardware, refer to Input Layout Transform Hardware.

The hardware layout transform is not supported in SoC designs in streaming-to-memory (S2M) mode. the S2M design uses a lightweight, external transform module.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA AI Suite: IP Reference Manual