Visible to Intel only — GUID: kho1661937374427
Ixiasoft
Visible to Intel only — GUID: kho1661937374427
Ixiasoft
37.3.2. Coefficient Quantization
- Signed or unsigned: if you want to represent negative coefficients, turn on Use signed vertical coefficients. If all coefficients are positive values, reduce logic and turn off Use signed vertical coefficients.
- Integer bits: the number of integer bits defines the maximum value that can be represented.
- Fraction bits: the number of fraction bits defines the precision with which the IP can convert floating-point coefficients into the fixed-point format.
The overall bit width of each coefficient is the sum of the integer and fraction bits, plus one extra bit for signed coefficients. When using Lanczos coefficients, Intel recommends the following settings:
- Turn on Use signed vertical coefficients as the Lanczos function for any number of lobes greater than 1 requires negative values, so the coefficients must be signed.
- Use 1 integer bit as the maximum value required for any Lanczos coefficient is 1.0
- Use between 6 and 8 fraction bits.
Typically, the filter coefficients produce noninteger floating-point values. To convert each floating-point coefficient into its closest quantized representation in the selected fixed-point format:
- Multiply each coefficient by 2 frac , where frac is the number of fraction bits you select
- Apply float to integer conversion to each coefficient
However, small errors in the coefficient values introduced by the quantization process can accumulate so that the coefficients in each phase no longer sum to their intended value. Generally, the coefficients in any phase should sum to exactly 1.0. Any value greater than 1.0 increases the overall brightness of the resulting image. Any value less than 1.0 reduces the brightness. The coefficients can sum to more or less than 1.0 if you want a brighter or darker image. You should still ensure your coefficients sum to your original, intended value post quantization. To restore the coefficients to values that sum to the intended value:
float quantization_error = 0.0; for (int j = 0; j < taps; j++) { quantization_error += original_float_coeff[j] - ((float)quant_coeff[j]); if (quantization_error < -0.5) { quant_coeff[j]--; quantization_error += 1.0; } else { if (quantization_error > -0.5) { quant_coeff[j]++; quantization_error -= 1.0; } } }