5.2.2.1. Multiply and Divide Performance

Nios II Classic Processor Reference Guide

Download PDF

ID 683620

Date 10/28/2016

Version current

Public

5.2.2.1. Multiply and Divide Performance

The Nios II/f core provides the following hardware multiplier options:

DSP Block—Includes DSP block multipliers available on the target device. This option is available only on Intel FPGAs that have DSP Blocks.
Embedded Multipliers—Includes dedicated embedded multipliers available on the target device. This option is available only on Intel FPGAs that have embedded multipliers.
Logic Elements—Includes hardware multipliers built from logic element (LE) resources.
None—Does not include multiply hardware. In this case, multiply operations are emulated in software.

The Nios II/f core also provides a hardware divide option that includes LE-based divide circuitry in the ALU.

Including an ALU option improves the performance of one or more arithmetic instructions.

Note: The performance of the embedded multipliers differ, depending on the target FPGA family.

Table 66. Hardware Multiply and Divide Details for the Nios II/f Core
ALU Option	Hardware Details	Cycles per Instruction	Result Latency Cycles	Supported Instructions
No hardware multiply or divide	Multiply and divide instructions generate an exception	–	–	None
Logic elements	ALU includes 32 x 4-bit multiplier	11	+2	`mul`, `muli`
DSP block on Stratix III families	ALU includes 32 x 32-bit multiplier	1	+2	`mul`, `muli`, `mulxss`, `mulxsu`, `mulxuu`
Embedded multipliers on Cyclone III families	ALU includes 32 x 16-bit multiplier	5	+2	`mul`, `muli`
Hardware divide	ALU includes multicycle divide circuit	4 – 66	+2	`div`, `divu`

The cycles per instruction value determines the maximum rate at which the ALU can dispatch instructions and produce each result. The latency value determines when the result becomes available. If there is no data dependency between the results and operands for back-to-back instructions, then the latency does not affect throughput. However, if an instruction depends on the result of an earlier instruction, then the processor stalls through any result latency cycles until the result is ready.

In the following code example, a multiply operation (with 1 instruction cycle and 2 result latency cycles) is followed immediately by an add operation that uses the result of the multiply. On the Nios II/f core, the addi instruction, like most ALU instructions, executes in a single cycle. However, in this code example, execution of the addi instruction is delayed by two additional cycles until the multiply operation completes.



mul r1, r2, r3        ; r1 = r2 * r3
addi r1, r1, 100      ; r1 = r1 + 100 (Depends on result of mul)

In contrast, the following code does not stall the processor.



mul r1, r2, r3        ; r1 = r2 * r3
or r5, r5, r6         ; No dependency on previous results
or r7, r7, r8         ; No dependency on previous results
addi r1, r1, 100      ; r1 = r1 + 100 (Depends on result of mul)

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Nios II Classic Processor Reference Guide

5.2.2.1. Multiply and Divide Performance