2.1.1.2. Speed and Latency

Intel® Hyperflex™ Architecture High-Performance Design Handbook

Download PDF

ID 683353

Date 12/08/2023

Version

Public

Visible to Intel only — GUID: esc1445894658366

Ixiasoft

View Details

Document Table of Contents

Document Table of Contents x

Answers to Top FAQs 1. Intel® Hyperflex™ FPGA Architecture Introduction 2. Intel® Hyperflex™ Architecture RTL Design Guidelines 3. Compiling Intel® Hyperflex™ Architecture Designs 4. Design Example Walk-Through 5. Retiming Restrictions and Workarounds 6. Optimization Example 7. Intel® Hyperflex™ Architecture Porting Guidelines 8. Appendices 9. Intel® Hyperflex™ Architecture High-Performance Design Handbook Archive 10. Intel® Hyperflex™ Architecture High-Performance Design Handbook Revision History

1. Intel® Hyperflex™ FPGA Architecture Introduction x

1.1. Intel® Hyperflex™ Architecture Design Concepts

2. Intel® Hyperflex™ Architecture RTL Design Guidelines x

2.1. High-Speed Design Methodology 2.2. Hyper-Retiming (Facilitate Register Movement) 2.3. Hyper-Pipelining (Add Pipeline Registers) 2.4. Hyper-Optimization (Optimize RTL)

2.1. High-Speed Design Methodology x

2.1.1. Set a High-Speed Target 2.1.2. Experiment and Iterate 2.1.3. Compile Components Independently 2.1.4. Optimize Sub-Modules 2.1.5. Avoid Broadcast Signals

2.1.1. Set a High-Speed Target x

2.1.1.1. Speed and Timing Closure 2.1.1.2. Speed and Latency

2.2. Hyper-Retiming (Facilitate Register Movement) x

2.2.1. Reset Strategies 2.2.2. Clock Enable Strategies 2.2.3. Preserving Registers During Synthesis 2.2.4. Timing Constraint Considerations 2.2.5. Clock Synchronization Strategies 2.2.6. Metastability Synchronizers 2.2.7. Initial Power-Up Conditions 2.2.8. Retiming through RAMs and DSPs

2.2.1. Reset Strategies x

2.2.1.1. Removing Asynchronous Resets 2.2.1.2. Synchronous Resets on Global Clock Trees 2.2.1.3. Synchronous Resets on I/O Ports 2.2.1.4. Duplicate and Pipeline Synchronous Resets

2.2.2. Clock Enable Strategies x

2.2.2.1. Localized Clock Enable 2.2.2.2. High Fan-Out Clock Enable 2.2.2.3. Clock Enable with Timing Exceptions

2.2.4. Timing Constraint Considerations x

2.2.4.1. Optimize Multicycle Paths 2.2.4.2. Overconstraints

2.2.5. Clock Synchronization Strategies x

2.2.5.1. Clock Domain Crossing Constraint Guidelines

2.2.7. Initial Power-Up Conditions x

2.2.7.1. Specifying Initial Memory Conditions 2.2.7.2. Initial Conditions and Retiming 2.2.7.3. Initial Conditions and Hyper-Registers 2.2.7.4. Retiming Reset Sequences

2.2.7.3. Initial Conditions and Hyper-Registers x

2.2.7.3.1. Implementing Clock Gating 2.2.7.3.2. Intel® Quartus® Prime Settings for Initial Conditions

2.3. Hyper-Pipelining (Add Pipeline Registers) x

2.3.1. Conventional Versus Hyper-Pipelining 2.3.2. Pipelining and Latency 2.3.3. Use Registers Instead of Multicycle Exceptions

2.3.2. Pipelining and Latency x

2.3.2.1. Pipelining at Variable Latency Locations 2.3.2.2. Automatic Pipeline Insertion

2.3.2.1. Pipelining at Variable Latency Locations x

2.3.2.1.1. Specifying a Latency-Insensitive False Path

2.3.2.2. Automatic Pipeline Insertion x

2.3.2.2.1. Step 1: Create the Variable Latency Module 2.3.2.2.2. Step 2: Instantiate the Variable Latency Module 2.3.2.2.3. Step 3: Verify Automatic Pipeline Insertion Option 2.3.2.2.4. (Optional) Auto-Pipeline Insertion without a Variable Latency Module

2.4. Hyper-Optimization (Optimize RTL) x

2.4.1. General Optimization Techniques 2.4.2. Optimizing Specific Design Structures

2.4.1. General Optimization Techniques x

2.4.1.1. Shannon’s Decomposition 2.4.1.2. Time Domain Multiplexing 2.4.1.3. Loop Unrolling 2.4.1.4. Loop Pipelining 2.4.1.5. Precomputation

2.4.1.1. Shannon’s Decomposition x

2.4.1.1.1. Shannon’s Decomposition Example 2.4.1.1.2. Identifying Circuits for Shannon’s Decomposition

2.4.1.4. Loop Pipelining x

2.4.1.4.1. Loop Pipelining Theory 2.4.1.4.2. Loop Pipelining Demonstration 2.4.1.4.3. Loop Pipelining and Synthesis Optimization

2.4.2. Optimizing Specific Design Structures x

2.4.2.1. High-Speed Clock Domains 2.4.2.2. Restructuring Loops 2.4.2.3. Control Signal Backpressure 2.4.2.4. Flow Control with FIFO Status Signals 2.4.2.5. Flow Control with Skid Buffers 2.4.2.6. Read-Modify-Write Memory 2.4.2.7. Counters and Accumulators 2.4.2.8. State Machines 2.4.2.9. Memory 2.4.2.10. DSP Blocks 2.4.2.11. General Logic 2.4.2.12. Modulus and Division 2.4.2.13. Resets 2.4.2.14. Hardware Re-use 2.4.2.15. Algorithmic Requirements 2.4.2.16. FIFOs 2.4.2.17. Ternary Adders

2.4.2.1. High-Speed Clock Domains x

2.4.2.1.1. Visualizing Clock Networks 2.4.2.1.2. Viewing Clock Networks in the Fitter Report 2.4.2.1.3. Viewing Clocks in the Timing Analyzer

2.4.2.9. Memory x

2.4.2.9.1. Intel® Hyperflex™ Architecture True Dual-Port Memory 2.4.2.9.2. Use Simple Dual-Port Memories 2.4.2.9.3. Intel® Hyperflex™ Architecture Simple Dual-Port Memory Example 2.4.2.9.4. Memory Mixed Port Width Ratio Limits 2.4.2.9.5. Unregistered RAM Outputs

3. Compiling Intel® Hyperflex™ Architecture Designs x

3.1. Compiling Submodules Independently 3.2. Design Assistant Design Rule Checking

3.2. Design Assistant Design Rule Checking x

3.2.1. Running Design Assistant During Compilation 3.2.2. Running Design Assistant in Analysis Mode

3.2.2. Running Design Assistant in Analysis Mode x

3.2.2.1. Cross-Probing from Design Assistant to Visualization Tools 3.2.2.2. Launching Design Assistant from Chip Planner 3.2.2.3. Launching Design Assistant from Timing Analyzer

4. Design Example Walk-Through x

4.1. Median Filter Design Example

4.1. Median Filter Design Example x

4.1.1. Step 1: Compile the Base Design 4.1.2. Step 2: Add Pipeline Stages and Remove Asynchronous Resets 4.1.3. Step 3: Add More Pipeline Stages and Remove All Asynchronous Resets 4.1.4. Step 4: Optimize Short Path and Long Path Conditions

5. Retiming Restrictions and Workarounds x

5.1. Setting the dont_merge Synthesis Attribute 5.2. Interpreting Critical Chain Reports

5.2. Interpreting Critical Chain Reports x

5.2.1. Insufficient Registers 5.2.2. Short Path/Long Path 5.2.3. Fast Forward Limit 5.2.4. Loops 5.2.5. One Critical Chain per Clock Domain 5.2.6. Critical Chains in Related Clock Groups 5.2.7. Complex Critical Chains 5.2.8. Extend to locatable node 5.2.9. Domain Boundary Entry and Domain Boundary Exit 5.2.10. Critical Chains with Dual Clock Memories 5.2.11. Critical Chain Bits and Buses 5.2.12. Delay Lines

5.2.1. Insufficient Registers x

5.2.1.1. Insufficient Registers Example 5.2.1.2. Optimizing Insufficient Registers 5.2.1.3. Critical Chains with Dual Clock Memories

5.2.2. Short Path/Long Path x

5.2.2.1. Hyper-Register Locations Not Available 5.2.2.2. Example for Hold Optimization 5.2.2.3. Optimizing Short Path/Long Path 5.2.2.4. Add Registers 5.2.2.5. Duplicate Common Nodes 5.2.2.6. Data and Control Plane

5.2.3. Fast Forward Limit x

5.2.3.1. Optimizing Path Limit

5.2.4. Loops x

5.2.4.1. Example of Loops Limiting the Critical Chain

6. Optimization Example x

6.1. Round Robin Scheduler

7. Intel® Hyperflex™ Architecture Porting Guidelines x

7.1. Design Migration and Performance Exploration 7.2. Top-Level Design Considerations

7.1. Design Migration and Performance Exploration x

7.1.1. Black-boxing Verilog HDL Modules 7.1.2. Black-boxing VHDL Modules 7.1.3. Clock Management 7.1.4. Pin Assignments 7.1.5. Transceiver Control Logic 7.1.6. Upgrade Outdated IP Cores

8. Appendices x

8.1. Appendix A: Parameterizable Pipeline Modules 8.2. Appendix B: Clock Enables and Resets

8.2. Appendix B: Clock Enables and Resets x

8.2.1. Synchronous Resets and Limitations 8.2.2. Retiming with Clock Enables 8.2.3. Resolving Short Paths

8.2.1. Synchronous Resets and Limitations x

8.2.1.1. Synchronous Resets Summary

8.2.2. Retiming with Clock Enables x

8.2.2.1. Example for Broadcast Control Signals

Answers to Top FAQs

1. Intel® Hyperflex™ FPGA Architecture Introduction

1.1. Intel® Hyperflex™ Architecture Design Concepts

2. Intel® Hyperflex™ Architecture RTL Design Guidelines

2.1. High-Speed Design Methodology

2.1.1. Set a High-Speed Target

2.1.1.1. Speed and Timing Closure

2.1.1.2. Speed and Latency

2.1.2. Experiment and Iterate

2.1.3. Compile Components Independently

2.1.4. Optimize Sub-Modules

2.1.5. Avoid Broadcast Signals

2.2. Hyper-Retiming (Facilitate Register Movement)

2.2.1. Reset Strategies

2.2.1.1. Removing Asynchronous Resets

2.2.1.2. Synchronous Resets on Global Clock Trees

2.2.1.3. Synchronous Resets on I/O Ports

2.2.1.4. Duplicate and Pipeline Synchronous Resets

2.2.2. Clock Enable Strategies

2.2.2.1. Localized Clock Enable

2.2.2.2. High Fan-Out Clock Enable

2.2.2.3. Clock Enable with Timing Exceptions

2.2.3. Preserving Registers During Synthesis

2.2.4. Timing Constraint Considerations

2.2.4.1. Optimize Multicycle Paths

2.2.4.2. Overconstraints

2.2.5. Clock Synchronization Strategies

2.2.5.1. Clock Domain Crossing Constraint Guidelines

2.2.6. Metastability Synchronizers

2.2.7. Initial Power-Up Conditions

2.2.7.1. Specifying Initial Memory Conditions

2.2.7.2. Initial Conditions and Retiming

2.2.7.3. Initial Conditions and Hyper-Registers

2.2.7.3.1. Implementing Clock Gating

2.2.7.3.2. Intel® Quartus® Prime Settings for Initial Conditions

2.2.7.4. Retiming Reset Sequences

2.2.8. Retiming through RAMs and DSPs

2.3. Hyper-Pipelining (Add Pipeline Registers)

2.3.1. Conventional Versus Hyper-Pipelining

2.3.2. Pipelining and Latency

2.3.2.1. Pipelining at Variable Latency Locations

2.3.2.1.1. Specifying a Latency-Insensitive False Path

2.3.2.2. Automatic Pipeline Insertion

2.3.2.2.1. Step 1: Create the Variable Latency Module

2.3.2.2.2. Step 2: Instantiate the Variable Latency Module

2.3.2.2.3. Step 3: Verify Automatic Pipeline Insertion Option

2.3.2.2.4. (Optional) Auto-Pipeline Insertion without a Variable Latency Module

2.3.3. Use Registers Instead of Multicycle Exceptions

2.4. Hyper-Optimization (Optimize RTL)

2.4.1. General Optimization Techniques

2.4.1.1. Shannon’s Decomposition

2.4.1.1.1. Shannon’s Decomposition Example

2.4.1.1.2. Identifying Circuits for Shannon’s Decomposition

2.4.1.2. Time Domain Multiplexing

2.4.1.3. Loop Unrolling

2.4.1.4. Loop Pipelining

2.4.1.4.1. Loop Pipelining Theory

2.4.1.4.2. Loop Pipelining Demonstration

2.4.1.4.3. Loop Pipelining and Synthesis Optimization

2.4.1.5. Precomputation

2.4.2. Optimizing Specific Design Structures

2.4.2.1. High-Speed Clock Domains

2.4.2.1.1. Visualizing Clock Networks

2.4.2.1.2. Viewing Clock Networks in the Fitter Report

2.4.2.1.3. Viewing Clocks in the Timing Analyzer

2.4.2.2. Restructuring Loops

2.4.2.3. Control Signal Backpressure

2.4.2.4. Flow Control with FIFO Status Signals

2.4.2.5. Flow Control with Skid Buffers

2.4.2.6. Read-Modify-Write Memory

2.4.2.7. Counters and Accumulators

2.4.2.8. State Machines

2.4.2.9. Memory

2.4.2.9.1. Intel® Hyperflex™ Architecture True Dual-Port Memory

2.4.2.9.2. Use Simple Dual-Port Memories

2.4.2.9.3. Intel® Hyperflex™ Architecture Simple Dual-Port Memory Example

2.4.2.9.4. Memory Mixed Port Width Ratio Limits

2.4.2.9.5. Unregistered RAM Outputs

2.4.2.10. DSP Blocks

2.4.2.11. General Logic

2.4.2.12. Modulus and Division

2.4.2.13. Resets

2.4.2.14. Hardware Re-use

2.4.2.15. Algorithmic Requirements

2.4.2.16. FIFOs

2.4.2.17. Ternary Adders

3. Compiling Intel® Hyperflex™ Architecture Designs

3.1. Compiling Submodules Independently

3.2. Design Assistant Design Rule Checking

3.2.1. Running Design Assistant During Compilation

3.2.2. Running Design Assistant in Analysis Mode

3.2.2.1. Cross-Probing from Design Assistant to Visualization Tools

3.2.2.2. Launching Design Assistant from Chip Planner

3.2.2.3. Launching Design Assistant from Timing Analyzer

4. Design Example Walk-Through

4.1. Median Filter Design Example

4.1.1. Step 1: Compile the Base Design

4.1.2. Step 2: Add Pipeline Stages and Remove Asynchronous Resets

4.1.3. Step 3: Add More Pipeline Stages and Remove All Asynchronous Resets

4.1.4. Step 4: Optimize Short Path and Long Path Conditions

5. Retiming Restrictions and Workarounds

5.1. Setting the dont_merge Synthesis Attribute

5.2. Interpreting Critical Chain Reports

5.2.1. Insufficient Registers

5.2.1.1. Insufficient Registers Example

5.2.1.2. Optimizing Insufficient Registers

5.2.1.3. Critical Chains with Dual Clock Memories

5.2.2. Short Path/Long Path

5.2.2.1. Hyper-Register Locations Not Available

5.2.2.2. Example for Hold Optimization

5.2.2.3. Optimizing Short Path/Long Path

5.2.2.4. Add Registers

5.2.2.5. Duplicate Common Nodes

5.2.2.6. Data and Control Plane

5.2.3. Fast Forward Limit

5.2.3.1. Optimizing Path Limit

5.2.4. Loops

5.2.4.1. Example of Loops Limiting the Critical Chain

5.2.5. One Critical Chain per Clock Domain

5.2.6. Critical Chains in Related Clock Groups

5.2.7. Complex Critical Chains

5.2.8. Extend to locatable node

5.2.9. Domain Boundary Entry and Domain Boundary Exit

5.2.10. Critical Chains with Dual Clock Memories

5.2.11. Critical Chain Bits and Buses

5.2.12. Delay Lines

6. Optimization Example

6.1. Round Robin Scheduler

7. Intel® Hyperflex™ Architecture Porting Guidelines

7.1. Design Migration and Performance Exploration

7.1.1. Black-boxing Verilog HDL Modules

7.1.2. Black-boxing VHDL Modules

7.1.3. Clock Management

7.1.4. Pin Assignments

7.1.5. Transceiver Control Logic

7.1.6. Upgrade Outdated IP Cores

7.2. Top-Level Design Considerations

8. Appendices

8.1. Appendix A: Parameterizable Pipeline Modules

8.2. Appendix B: Clock Enables and Resets

8.2.1. Synchronous Resets and Limitations

8.2.1.1. Synchronous Resets Summary

8.2.2. Retiming with Clock Enables

8.2.2.1. Example for Broadcast Control Signals

8.2.3. Resolving Short Paths

9. Intel® Hyperflex™ Architecture High-Performance Design Handbook Archive

10. Intel® Hyperflex™ Architecture High-Performance Design Handbook Revision History

Visible to Intel only — GUID: esc1445894658366

Ixiasoft

View Details

2.1.1.2. Speed and Latency

The following table illustrates the rate of growth for various types of circuits as the bus width increases. The circuit functions interleave with big O notations of area as a function of bus width, starting at sub-linear with log(N), to super-linear with N*N.

Table 2. Effect of Bus Width on Area
	Circuit Function
Bus Width (N)	log N	Mux	ripple add	N*log N	barrel shift	Crossbar	N*N
16	4	5	16	64	64	80	256
32	5	11	32	160	160	352	1024
64	6	21	64	384	384	1344	4096
128	7	43	128	896	896	5504	16384
256	8	85	256	2048	2048	21760	65536

Typically, circuit components use more than 2X the area as the bus width doubles. For a simple circuit like a mux, the area grows sub-linearly as the bus width increases. Cutting the bus width of a mux in half provides slightly worse than linear area benefit. A ripple adder grows linearly as the bus width increases.

More complex circuits, like barrel shifters and crossbars, grow super-linearly as bus width increases. If you cut the bus width of a barrel shifter, crossbar, or other complex circuit in half, the area benefit can be significantly better than half, approaching quadratic rates. For components in which all inputs affect all outputs, increasing the bus width can cause quadratic growth. The expectation is then that, if you take advantage of speed-up to work on half-width buses, you generate a design with less than half the original area.

When working with streaming datapaths, the number of registers is a fair approximation of the latency of the pipeline in bits. Reducing the width by half creates the opportunity to double the number of pipeline stages, without negatively impacting latency. This higher performance generally requires significantly less than double the amount of additional registering to create a latency profit.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Hyperflex™ Architecture High-Performance Design Handbook

2.1.1.2. Speed and Latency