Intel® Hyperflex™ Architecture High-Performance Design Handbook

ID 683353
Date 12/08/2023
Public
Document Table of Contents

4.1.4. Step 4: Optimize Short Path and Long Path Conditions

After removing asynchronous registers and adding pipeline stages, the Fast Forward Details report suggests that short path and long path conditions limit further optimization. In this example, the longest path limits the fMAX for this specific clock domain. To increase the performance, follow these steps to reduce the length of the longest path for this clock domain.
  1. To view the long path information, click the Critical Chain Details tab in the Fast Forward Details report. Review the structure of the logic around this path, and consider the associated RTL code. This path involves the node module of the node.v file. The critical path relates to the computation of registers data_hi and data_lo, which are part of several comparators.

    The following shows the original RTL for this path:

    always @(*)
      begin : comparator
        if(data_a < data_b) begin
          sel0 = 1'b0; // data_a : lo / data_b : hi
        end else begin
          sel0 = 1'b1; // data_b : lo / data_a : hi
        end
      end
    
    always @(*)
        begin : mux_lo_hi
            case (sel0)
                1'b0 :
                begin
                    if(LOW_MUX == 1)
                        data_lo = data_a;
                    if(HI_MUX == 1)
                        data_hi = data_b;
                end
                1'b1 :
                begin
                    if(LOW_MUX == 1)
                        data_lo = data_b;
                    if(HI_MUX == 1)
                        data_hi = data_a;
                end
                default :
                begin
                    data_lo = {DATA_WIDTH{1'b0}};
                    data_hi = {DATA_WIDTH{1'b0}};
                end
            endcase
        end

    The Compiler infers the following logic from this RTL:

    • A comparator that creates the sel0 signal
    • A pair of muxes that create the data_hi and data_lo signals, as the following figure shows:
    Figure 99. Node Component Connections
  2. Review the pixel_network.v file that instantiates the node module. The node module's outputs are unconnected when you do not use them. These unconnected outputs result in no use of the LOW_MUX or HI_MUX code. Rather than inferring muxes, use bitwise logic operation to compute the values of the data_hi and data_lo signals, as the following example shows:
    reg [DATA_WIDTH-1:0] sel0;
    
    always @(*)
      begin : comparator
        if(data_a < data_b) begin
          sel0 = {DATA_WIDTH{1'b0}}; // data_a : lo / data_b : hi
        end else begin
          sel0 = {DATA_WIDTH{1'b1}}; // data_b : lo / data_a : hi
     end
    	
     data_lo = (data_b & sel0) | (data_a & sel0);
     data_hi = (data_a & sel0) | (data_b & sel0);
    end
  3. Once again, compile the design and view the Fast Forward Details report. The performance increase is similar to the estimates, and short path and long path combinations no longer limit further performance. After this step, only a logical loop limits further performance.
    Figure 100. Short Path and Long Path Conditions Optimized
    Note: As an alternative to completing the preceding steps, you can open and compile the Median_filter_<version>/Final/median.qpf project file that already includes these changes, and then observe the results.