Visible to Intel only — GUID: mtr1430270886847
Ixiasoft
Visible to Intel only — GUID: mtr1430270886847
Ixiasoft
6.1. Round Robin Scheduler
The round robin scheduler is a basic functional block. The following example uses a modulus operator to determine the next client for service. The modulus operator is relatively slow and area inefficient because the modulus operator performs division.
Source Code for Round Robin Scheduler
module round_robin_modulo (last, requests, next); parameter CLIENTS = 7; parameter LOG2_CLIENTS = 3; // previous client to be serviced input wire [LOG2_CLIENTS -1:0] last; // Client requests:- input wire [CLIENTS -1:0] requests; // Next client to be serviced: - output reg [LOG2_CLIENTS -1:0] next; //Schedule the next client in a round robin fashion, based on the previous always @* begin integer J, K; begin : find_next next = last; // Default to staying with the previous for (J = 1; J < CLIENTS; J=J+1) begin K = (last + J) % CLIENTS; if (requests[K] == 1'b1) begin next = K[0 +: LOG2_CLIENTS]; disable find_next; end end // of the for-loop end // of 'find_next' end endmodule
The Retiming Summary report identifies insufficient registers limiting Hyper-Retiming on the critical chain. The chain starts from the register that connects to the last input, through the modulus operator implemented using a divider, and continuing to the register that connects to the next output.
The 44 elements in the critical chain above correspond to the circuit diagram below that has 10 levels of logic. The modulus operator contributes significantly to the low performance. Seven of the 10 levels of logic are part of the implementation for the modulus operator.
As Figure 125 shows, Fast Forward compilation estimates a 140% performance improvement from adding two pipeline stages at the module inputs, for retiming through the logic cloud. At this point, the critical chain is a short path/long path and the chain involves the modulus operator.
The divider in the modulus operation is the bottleneck that requires RTL modification. Paths through the divider exist in the critical chain for all steps in the Fast Forward compile. Consider alternate implementations to calculate the next client to service, and avoid the modulus operator. If you switch to an implementation that specifies the number of clients as a power of two, determining the next client to service does not require a modulus operator. When you instantiate the module with fewer than 2n clients, tie the unused request inputs to logic 0.
Source Code for Round Robin Scheduler with Performance Improvement with 2n Client Inputs
module round_robin_modulo (last, requests, next); parameter LOG2_CLIENTS = 3; parameter CLIENTS = 2**LOG2_CLIENTS; // previous client to be serviced input wire [LOG2_CLIENTS -1:0] last; // Client requests:- input wire [CLIENTS -1:0] requests; // Next client to be serviced: - output reg [LOG2_CLIENTS -1:0] next; //Schedule the next client in a round robin fashion, based on the previous always @(next or last or requests) begin integer J, K; begin : find_next next = last; // Default to staying with the previous for (J = 1; J < CLIENTS; J=J+1) begin K = last + J; if (requests[K[0 +: LOG2_CLIENTS]] == 1'b1) begin next = K[0 +: LOG2_CLIENTS]; disable find_next; end end // of the for-loop end // of 'find_next' end endmodule
Even without any Fast Forward optimizations (the Base Performance step), this round robin implementation runs at double the frequency compared with the version without the performance improvement in Source Code for Round Robin Scheduler. Although critical chains in both versions contain only two registers, the critical chain for the 2n version contains only 26 elements, compared to 44 elements in the modulus version.
The 26 elements in the critical chain correspond to the following circuit diagram with only four levels of logic.
By adding three register stages at the input, for retiming through the logic cloud, Fast Forward Compile takes the circuit performance to 1 GHz. Similar to the modulus version, the final critical chain after Fast Forward optimization has a limiting reason of short path/long path, as Figure 132 shows, but the performance is 1.6 times the performance of the modulus version
Removing the modulus operator, and switching to a power-of-two implementation, are small design changes that provide a dramatic performance increase.
- Use natural powers of two for math operations whenever possible
- Explore alternative implementations for seemingly basic functions.
In this example, changing the implementation of the round robin logic provides more performance increase than adding pipeline stages.