7.3.3. Storing Variables and Arrays in Private Memory

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

Visible to Intel only — GUID: xlj1517596333693

Ixiasoft

View Details

Document Table of Contents

Document Table of Contents x

1. Introduction to Standard Edition Best Practices Guide 2. Reviewing Your Kernel's report.html File 3. OpenCL Kernel Design Best Practices 4. Profiling Your Kernel to Identify Performance Bottlenecks 5. Strategies for Improving Single Work-Item Kernel Performance 6. Strategies for Improving NDRange Kernel Data Processing Efficiency 7. Strategies for Improving Memory Access Efficiency 8. Strategies for Optimizing FPGA Area Usage A. Additional Information

1. Introduction to Standard Edition Best Practices Guide x

1.1. FPGA Overview 1.2. Pipelines 1.3. Single Work-Item Kernel versus NDRange Kernel 1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File x

2.1. High Level Design Report Layout 2.2. Reviewing the Report Summary 2.3. Reviewing Loop Information 2.4. Reviewing Area Information 2.5. Verifying Information on Memory Replication and Stalls 2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report 2.7. HTML Report: Area Report Messages 2.8. HTML Report: Kernel Design Concepts

2.3. Reviewing Loop Information x

2.3.1. Loop Analysis Report of an OpenCL Design Example 2.3.2. Changing the Memory Access Pattern Example 2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information x

2.4.1. Area Analysis by Source 2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls x

2.5.1. Features of the System Viewer 2.5.2. Features of the Kernel Memory Viewer

2.7. HTML Report: Area Report Messages x

2.7.1. Area Report Message for Board Interface 2.7.2. Area Report Message for Function Overhead 2.7.3. Area Report Message for State 2.7.4. Area Report Message for Feedback 2.7.5. Area Report Message for Constant Memory 2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts x

2.8.1. Kernels 2.8.2. Global Memory Interconnect 2.8.3. Local Memory 2.8.4. Nested Loops 2.8.5. Loops in a Single Work-Item Kernel 2.8.6. Channels 2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices x

3.1. Transferring Data Via Channels or OpenCL Pipes 3.2. Unrolling Loops 3.3. Optimizing Floating-Point Operations 3.4. Allocating Aligned Memory 3.5. Aligning a Struct with or without Padding 3.6. Maintaining Similar Structures for Vector Type Elements 3.7. Avoiding Pointer Aliasing 3.8. Avoid Expensive Functions 3.9. Avoiding Work-Item ID-Dependent Backward Branching

3.1. Transferring Data Via Channels or OpenCL Pipes x

3.1.1. Characteristics of Channels and Pipes 3.1.2. Execution Order for Channels and Pipes 3.1.3. Optimizing Buffer Inference for Channels or Pipes 3.1.4. Best Practices for Channels and Pipes

3.3. Optimizing Floating-Point Operations x

3.3.1. Floating-Point versus Fixed-Point Representations

4. Profiling Your Kernel to Identify Performance Bottlenecks x

4.1. Best Practices 4.2. GUI 4.3. Interpreting the Profiling Information 4.4. Limitations

4.2. GUI x

4.2.1. Source Code Tab 4.2.2. Kernel Execution Tab 4.2.3. Autorun Captures Tab

4.2.1. Source Code Tab x

4.2.1.1. Tool Tip Options

4.3. Interpreting the Profiling Information x

4.3.1. Stall, Occupancy, Bandwidth 4.3.2. Activity 4.3.3. Cache Hit 4.3.4. Profiler Analyses of Example OpenCL Design Scenarios 4.3.5. Autorun Profiler Data

4.3.1. Stall, Occupancy, Bandwidth x

4.3.1.1. Stalling Channels

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios x

4.3.4.1. High Stall Percentage 4.3.4.2. Low Occupancy Percentage 4.3.4.3. Low Bandwidth Efficiency 4.3.4.4. High Stall and High Occupancy Percentages 4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.7. Stalling Channels 4.3.4.8. High Stall and Low Occupancy Percentages

5. Strategies for Improving Single Work-Item Kernel Performance x

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback 5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays 5.3. Good Design Practices for Single Work-Item Kernel

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback x

5.1.1. Removing Loop-Carried Dependency 5.1.2. Relaxing Loop-Carried Dependency 5.1.3. Simplifying Loop-Carried Dependency 5.1.4. Transferring Loop-Carried Dependency to Local Memory 5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

6. Strategies for Improving NDRange Kernel Data Processing Efficiency x

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size 6.2. Kernel Vectorization 6.3. Multiple Compute Units 6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization 6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

6.2. Kernel Vectorization x

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units x

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

7. Strategies for Improving Memory Access Efficiency x

7.1. General Guidelines on Optimizing Memory Accesses 7.2. Optimize Global Memory Accesses 7.3. Performing Kernel Computations Using Constant, Local or Private Memory 7.4. Improving Kernel Performance by Banking the Local Memory 7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor 7.6. Minimizing the Memory Dependencies for Loop Pipelining

7.2. Optimize Global Memory Accesses x

7.2.1. Contiguous Memory Accesses 7.2.2. Manual Partitioning of Global Memory

7.2.2. Manual Partitioning of Global Memory x

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory x

7.3.1. Constant Cache Memory 7.3.2. Preloading Data to Local Memory 7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory x

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

8. Strategies for Optimizing FPGA Area Usage x

8.1. Compilation Considerations 8.2. Board Variant Selection Considerations 8.3. Memory Access Considerations 8.4. Arithmetic Operation Considerations 8.5. Data Type Selection Considerations

A. Additional Information x

A.1. Document Revision History for the Standard Edition Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide

1.1. FPGA Overview

1.2. Pipelines

1.3. Single Work-Item Kernel versus NDRange Kernel

1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File

2.1. High Level Design Report Layout

2.2. Reviewing the Report Summary

2.3. Reviewing Loop Information

2.3.1. Loop Analysis Report of an OpenCL Design Example

2.3.2. Changing the Memory Access Pattern Example

2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information

2.4.1. Area Analysis by Source

2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls

2.5.1. Features of the System Viewer

2.5.2. Features of the Kernel Memory Viewer

2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report

2.7. HTML Report: Area Report Messages

2.7.1. Area Report Message for Board Interface

2.7.2. Area Report Message for Function Overhead

2.7.3. Area Report Message for State

2.7.4. Area Report Message for Feedback

2.7.5. Area Report Message for Constant Memory

2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts

2.8.1. Kernels

2.8.2. Global Memory Interconnect

2.8.3. Local Memory

2.8.4. Nested Loops

2.8.5. Loops in a Single Work-Item Kernel

2.8.6. Channels

2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices

3.1. Transferring Data Via Channels or OpenCL Pipes

3.1.1. Characteristics of Channels and Pipes

3.1.2. Execution Order for Channels and Pipes

3.1.3. Optimizing Buffer Inference for Channels or Pipes

3.1.4. Best Practices for Channels and Pipes

3.2. Unrolling Loops

3.3. Optimizing Floating-Point Operations

3.3.1. Floating-Point versus Fixed-Point Representations

3.4. Allocating Aligned Memory

3.5. Aligning a Struct with or without Padding

3.6. Maintaining Similar Structures for Vector Type Elements

3.7. Avoiding Pointer Aliasing

3.8. Avoid Expensive Functions

3.9. Avoiding Work-Item ID-Dependent Backward Branching

4. Profiling Your Kernel to Identify Performance Bottlenecks

4.1. Best Practices

4.2. GUI

4.2.1. Source Code Tab

4.2.1.1. Tool Tip Options

4.2.2. Kernel Execution Tab

4.2.3. Autorun Captures Tab

4.3. Interpreting the Profiling Information

4.3.1. Stall, Occupancy, Bandwidth

4.3.1.1. Stalling Channels

4.3.2. Activity

4.3.3. Cache Hit

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios

4.3.4.1. High Stall Percentage

4.3.4.2. Low Occupancy Percentage

4.3.4.3. Low Bandwidth Efficiency

4.3.4.4. High Stall and High Occupancy Percentages

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.7. Stalling Channels

4.3.4.8. High Stall and Low Occupancy Percentages

4.3.5. Autorun Profiler Data

4.4. Limitations

5. Strategies for Improving Single Work-Item Kernel Performance

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

5.1.1. Removing Loop-Carried Dependency

5.1.2. Relaxing Loop-Carried Dependency

5.1.3. Simplifying Loop-Carried Dependency

5.1.4. Transferring Loop-Carried Dependency to Local Memory

5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

5.3. Good Design Practices for Single Work-Item Kernel

6. Strategies for Improving NDRange Kernel Data Processing Efficiency

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size

6.2. Kernel Vectorization

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

7. Strategies for Improving Memory Access Efficiency

7.1. General Guidelines on Optimizing Memory Accesses

7.2. Optimize Global Memory Accesses

7.2.1. Contiguous Memory Accesses

7.2.2. Manual Partitioning of Global Memory

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory

7.3.1. Constant Cache Memory

7.3.2. Preloading Data to Local Memory

7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor

7.6. Minimizing the Memory Dependencies for Loop Pipelining

8. Strategies for Optimizing FPGA Area Usage

8.1. Compilation Considerations

8.2. Board Variant Selection Considerations

8.3. Memory Access Considerations

8.4. Arithmetic Operation Considerations

8.5. Data Type Selection Considerations

A. Additional Information

A.1. Document Revision History for the Standard Edition Best Practices Guide

Visible to Intel only — GUID: xlj1517596333693

Ixiasoft

View Details

7.3.3. Storing Variables and Arrays in Private Memory

The implements private memory using FPGA registers or block RAMs. The offline compiler analyzes the private memory accesses and promotes them to register accesses. The offline compiler promotes most scalar variablessuch as float, int, and char. It also promotes aggregate data types if accesses are constants at compilation time. Typically, private memory is useful for storing single variables or small arrays. Registers are plentiful hardware resources in FPGAs, and it is almost always better to use private memory instead of other memory types whenever possible. The kernel can access private memories in parallel, allowing them to provide more bandwidth than any other memory type (that is, global, local, and constant memories).

For more information on the implementation of private memory using registers, refer to the Inferring a Register section of the Standard Edition Programming Guide.

Related Information

Inferring a Register

Level Two Title

7.3.2. Preloading Data to Local Memory 7.4. Improving Kernel Performance by Banking the Local Memory

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

7.3.3. Storing Variables and Arrays in Private Memory