Intel® Iris® Xe GPU Architecture

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Intel® Iris® Xe GPU Architecture

The Intel^® Iris^® X^e GPU family consists of a series of microarchitectures, ranging from integrated/low power (X^e-LP), to enthusiast/high performance gaming (X^e-HPG), data center/AI (X^e-HP) and high performance computing (X^e-HPC).

Intel® Iris® Xe family

This chapter introduces X^e GPU family microarchitectures and configuration parameters.

X^e-LP Execution Units (EUs)

An Execution Unit (EU) is the smallest thread-level building block of the Intel^® Iris^® X^e-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.

Xe-LP-EU

X^e-LP EU supports diverse data types FP16, INT16 and INT8 for AI applications. The Intel® GPU Compute Throughput Rates (Ops/clock/EU) table compares the the EU throughput rates of X^e-LP vs that of Intel^® Gen 11 GPUs.

Intel^® GPU Compute Throughput Rates (Ops/clock/EU)
	Intel^® Iris^® X^e-LP	Gen 11
FP32	8	8
FP16	16	16
INT32	8	4
INT16	16	8
INT8	32 (DP4A)	NA

X^e-LP Dual Subslices

Each X^e-LP Dual Subslice (DSS) consists of an EU array of 16 EUs, an instruction cache, a local thread dispatcher, Shared Local Memory (SLM), and a data port of 128B/cycle. It is called dual subslice because the hardware can pair two EUs for SIMD16 executions.

The SLM is a 128KB High Bandwidth Memory (HBM) accessible from the EUs in the subslice. One important usage of SLM is to share atomic data and signals among the concurrent work-items executing in a subslice. For this reason, if a kernel’s work-group contains synchronization operations, all work-items of the work-group must be allocated to a single subslice so that they have shared access to the same 128KB SLM. The work-group size must be chosen carefully to maximize the occupancy and utilization of the subslice. In contrast, if a kernel does not access SLM, its work-items can be dispatched across multiple subslices.

The following table summarizes the computing capacity of a subslice.

Subslice computing capacity
GPU Generation	EUs	Threads	Operations
Intel Iris Xe ICX	8
Intel Iris Xe-LP TGL	16

X^e-LP Slice

Each X^e-LP slice consists of six (dual) subslices for a total of 96 EUs, up to 16MB L3 cache, 128B/cycle bandwidth to L3 and 128B/cycle bandwidth to memory.

Xe-LP slice

Intel UHD Architecture Parameters across Generations

The following table summarizes the key architecture parameters in the current released products with Intel UHD Graphics:

Key architecture parameters, Intel UHD Graphics
Generations	Threads per VE/EU	VEs/EUs per X^e-core/SubSlice	X^e-cores/SubSlices	Total Threads	Total Operations
Gen9 (BDW)	7	8	3	168	1344
Intel Iris Xe ICL (Gen11)	7	8	8	448	3584
Intel Iris Xe-LP TGL (Gen12)	7	16	6	672	5376

X^e-Core

Unlike the X^e-LP and prior generations of Intel GPUs that used the Execution Unit (EU) as a compute unit, X^e-HPG and X^e-HPC use the X^e-core. This is similar to an X^e-LP dual subslice.

An X^e-core contains vector and matrix ALUs, which are referred to as vector and matrix engines.

An X^e-core of the X^e-HPC GPU contains 8 vector and 8 matrix engines, alongside a large 512KB L1 cache/SLM. It powers the Ponte Vecchio GPU. Each vector engine is 512 bit wide supporting 16 FP32 SIMD operations with fused FMAs. With 8 vector engines, the X^e-core delivers 512 FP16, 256 FP32 and 256 FP64 operations/cycle. Each matrix engine is 4096 bit wide. With 8 matrix engines, the X^e-core delivers 8192 int8 and 4096 FP16/BF16 operations/cycle. The X^e-core provides 1024B/cycle load/store bandwidth to the memory system.

Xe-core

X^e-Slice

An X^e-slice contains 16 X^e-core for a total of 8MB L1 cache, 16 ray tracing units and 1 hardware context.

Xe-slice

X^e-Stack

An X^e-stack contains up to 4 X^e-slice: 64 X^e-cores, 64 ray tracing units, 4 hardware contexts, 4 HBM2e controllers, 1 media engine, and 8 Xe-Link high speed coherent fabric. It also contains a shared L2 cache.

Xe-stack

X^e-HPC 2-Stack Ponte Vecchio GPU

An X^e-HPC 2-stack Ponte Vecchio GPU consists of 2 stacks:: 8 slices, 128 X^e-cores, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers, and 16 Xe-Links.

Xe-HPC 2-Stack

X^e-HPG GPU

X^e-HPG is the enthusiast or high performance gaming variant of the X^e architecture. The microarchitecture is focused on graphics performance and supports hardware-accelerated ray tracing.

An X^e-core of the X^e-HPG GPU contains 16 vector and 16 matrix engines. It powers the Intel^® Arc GPUs. Each vector engine is 256 bit wide, supporting 8 FP32 SIMD operations with fused FMAs. With 16 vector engines, the X^e-core delivers 256 FP32 operations/cycle. Each matrix engine is 1024 bit wide. With 16 matrix engines, the X^e-core delivers 4096 int8 and 2048 FP16/BF16 operations/cycle. The X^e-core provides 512B/cycle load/store bandwidth to the memory system.

An X^e-HPG GPU consists of 8 X^e-HPG-slice, which contains up to 4 X^e-HPG-cores for a total of 4096 FP32 ALU units/shader cores.

X^e- Intel^® Data Center GPU Flex Series

Intel^® Data Center GPU Flex Series (formerly codenamed Arctic Sound-M) come in two configurations. The 150W option has 32 X^e-cores on a PCIe Gen4 card. The 75W option has two GPUs for 16 X^e-cores (8 X^e-cores per GPU). Both configurations come with 4 X^e media engines, the industry’s first AV1 hardware encoder and accelerator for data center, GDDR6 memory, ray tracing units, and built-in XMX AI acceleration.

Intel^® Data Center GPU Flex Series are derivatives of the X^e-HPG GPUs. An Intel^® Data Center GPU Flex 170 consists of 4 X^e-HPG-slices for a total of 16 X^e-cores with 2048 FP32 ALU units/shader cores.

Intel® Data Center GPU Flex Series

Targeting data center cloud gaming, media streaming and video analytics applications, Intel^® Data Center GPU Flex Series provide hardware accelerated AV1 encoder, delivering a 30% bit-rate improvement without compromising on quality. It supports 8 simultaneous 4K streams or more than 30 1080p streams per card. AI models can be applied to the decoded streams utilizing Intel^® Data Center GPU Flex Series’ X^e-cores.

Media streaming and delivery software stacks lean on Intel^® oneVPL to decode and encode acceleration for all the major codecs including AV1. Media distributors can choose from the two leading media frameworks FFMPEG or GStreamer, both enabled for acceleration with oneVPL on Intel CPUs and GPUs.

In parallel to oneVPL accelerating decoding and encoding of media streams, oneDNN (oneAPI Deep Neural Network library) delivers AI optimized kernels enabled to accelerate inference modes in TensorFlow or PyTorch frameworks, or with the OpenVINO model optimizer and inference engine to further accelerate inference and speed customer deployment of their workloads.

Terminology and Configuration Summary

The following Architecture Terminology Changes table maps legacy GPU terminologies (used in Generation 9 through Generation 12 Intel^® Core^TM architectures) to their new names in the Intel^® Iris^® X^e GPU (Generation 12.7 and newer) architecture paradigm.

Architecture Terminology Changes
Old Term	New Intel Term	Generic Term	New Abbreviation
Execution Unit (EU)	X^e Vector Engine	Vector Engine	XVE
Systolic/”DPAS part of EU”	X^e Matrix eXtension	Matrix Engine	XMX
Subslice (SS) or Dual Subslice (DSS)	X^e-core	NA	XC
Slice	Render Slice / Compute Slice	Slice	SLC
Tile	Stack	Stack	STK

The following Xe Configurations table lists the hardware characteristics across the X^e family GPUs.

X^e Configurations
Architecture	X^e-LP (TGL)	X^e-HPG (Arc A770)	X^e-HPG (Data Center GPU Flex 170)	X^e-HPC (Ponte Vecchio 1 Stack)
Slice count	1	8	4	4
XC (DSS/SS) count	6	32	16	64
XVE (EU) / XC	16	16	16	8
XVE count	96	512	256	512
Threads / XVE	7	8	8	8
Thread count	672	4096	4096	4096
FLOPs / clk - single precision, MAD	1536	8192	8192	16384
FLOPs / clk - double precision, MAD	NA	NA	NA	16384
FLOPs / clk - FP16 DP4AS	NA	65536	65536	262144
GTI bandwidth bytes / unslice-clk	r:128, w:128	r:512, w:512	r:256, w:256	r:1024, w:1024
LL cache size	3.84MB	16MB	8MB	up to 204MB
SLM size
FMAD, SP (ops / XVE / clk)	8	8	8	16
SQRT, SP (ops / XVE / clk)	2	2	2	4

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide