Intel® Hyperflex™ Architecture High-Performance Design Handbook

ID 683353
Date 12/08/2023
Public
Document Table of Contents

2.4.1.4.2. Loop Pipelining Demonstration

The following demonstrates proper loop pipelining to optimize an accumulator in an example design. In the original implementation, the accumulator data input in multiplies by x, adds to the previous value out, multiplied by y. This demonstration improves performance using these techniques:
  1. Implement separation of forward logic
  2. Retime the loop register
  3. Create the feedback loop equivalence with cascade logic
Figure 54. Original Loop Structure

Original Loop Structure Example Verilog HDL Code

module orig_loop_strct (rstn, clk, in, x, y, out);
   input clk, rstn, in, x, y;
   output out;
   reg    out;
   reg in_reg;

always @ ( posedge clk )
   if ( !rstn ) begin
      in_reg <= 1'b0;
   end else begin
      in_reg <= in;
   end

always @ ( posedge clk )
   if ( !rstn ) begin
      out <= 1'b0;
   end else begin
      out <= y*out + x*in_reg;
   end
endmodule //orig_loop_strct

The first stage of optimization is rewriting logic to remove as much logic as possible from the loop, and create a forward logic block. The goal of rewriting is to remove as much work as possible from the feedback loop. The Compiler cannot automatically optimize any logic in a feedback loop. Consider the following recommendations in removing logic from the loop:

  • Evaluate as many decisions and perform as many calculations in advance of the loop, that do not directly rely on the loop value.
  • Potentially pass logic into the register stage before passing into the loop.

After rewriting the logic, the Compiler can now freely retime the logic that you move to the forward path.

Figure 55. Separation of Forward Logic from the Loop

In the next optimization stage, retime the loop register to ensure that the design functions the same as the original loop circuitry.

Figure 56. Retime Loop Register

Finally, further optimize the loop by repeating the first optimization steps with the logic in the highlighted boundary.

Figure 57. Results of Cascade Loop Logic, Hyper-Retimer, and Synthesis Optimizations (Four Level Optimization)

Four Level Optimization Example Verilog HDL Code

module cll_hypr_rtm_synopt ( rstn, clk, x, y, in, out);
   input rstn, clk, x, y, in;

   output out;
   reg    out;

   reg in_reg;

   wire out_add1;
   wire out_add2;
   wire out_add3;
   wire out_add4;

   reg out_add1_reg1;
   reg out_add1_reg2;
   reg out_add1_reg3;
   reg out_add1_reg4;

always @ ( posedge clk )
   if ( !rstn ) begin
      in_reg <= 0;
   end else begin
      in_reg <= in;
   end

always @ ( posedge clk )
   if ( !rstn ) begin
      out_add1_reg1 <= 0;
      out_add1_reg2 <= 0;
      out_add1_reg3 <= 0;
      out_add1_reg4 <= 0;
   end else begin
      out_add1_reg1 <= out_add1;
      out_add1_reg2 <= out_add1_reg1;
      out_add1_reg3 <= out_add1_reg2;
      out_add1_reg4 <= out_add1_reg3;
   end

assign out_add1 = x*in_reg  + ((((y*out_add1_reg4)*y)*y)*y);
assign out_add2 = out_add1 + (y*out_add1_reg1);
assign out_add3 = out_add2 + ((y*out_add1_reg2)*y);
assign out_add4 = out_add3 + (((y*out_add1_reg3)*y)*y);

always @ ( posedge clk ) begin
   if ( !rstn )
      out <= 0;
   else
      out <= out_add4;
end
endmodule //cll_hypr_rtm_synopt