Introduction
The GPU representation hierarchy is progressively becoming an area of liking for heavy learning researchers and practitioners alike. By building an intuition astir representation hierarchy, developers tin minimize representation entree latency, maximize representation bandwidth, and trim powerfulness depletion starring to shorter processing times, accelerated information transfer, and cost-effective compute usage. A thorough knowing of representation architecture will alteration developers to execute highest GPU capabilities astatine scale.
CUDA Refresher
CUDA (Compute Unified Device Architecture) is simply a parallel computing level developed by NVIDIA for configuring GPUs.
The execution of a CUDA programme originates erstwhile the big codification (CPU serial code) calls a kernel function. This usability telephone launches a grid of threads connected a instrumentality (GPU) to process different information components successful parallel.
A thread is comprised of the program’s code, the existent execution constituent successful the code, arsenic good arsenic the values of its variables and information structures. A group of threads shape a thread artifact and a group of thread blocks constitute the CUDA kernel grid. The package components, threads and thread blocks, correspond straight to their hardware analogs, the CUDA halfway and the CUDA Streaming Multiprocessor (SM).
All together, these dress up the constituent parts of the GPU.
Threads are organized into blocks and blocks are organized into grids. Figure taken from the NVIDIA Technical Blog.
Figure taken from NVIDIA H100 White Paper.
H100s present a caller Thread Block Cluster architecture, extending GPU’s beingness programming architecture to now see Threads, Thread Blocks, Thread Block Clusters, and Grids.
CUDA Memory Types
There are varying degrees of accessibility and long for representation retention types utilized by a CUDA device. When a CUDA programmer assigns a adaptable to a circumstantial CUDA representation type, they dictate really the adaptable is accessed, the velocity astatine which it’s accessed, and the grade of its visibility.
Here’s a speedy overview of the different representation types:
Figure taken from Chapter 5 of the 4th version of the textbook, Programming Massively Parallel Processors.
Register memory is backstage to each thread. This intends that erstwhile that peculiar thread ends, the information for that registry is lost.
Local memory is besides backstage to each thread, but it’s slower than registry memory.
Shared memory is accessible to each threads successful the aforesaid artifact and lasts for the block’s lifetime.
Global memory holds information that lasts for the long of the grid/host. All threads and the big person entree to world memory.
Constant memory is read-only and designed for information that does not alteration for the long of the kernel’s execution.
Texture memory is different read-only representation type perfect for physically adjacent information access. Its usage tin mitigate representation postulation and summation capacity compared to world memory.
GPU Memory Hierarchy
The Speed-Capacity Tradeoff
It is important to understand that pinch respect to representation entree efficiency, location is a tradeoff betwixt bandwidth and representation capacity. Higher velocity is correlated pinch little capacity.
Registers
Registers are the fastest representation components connected a GPU, comprising the registry record that supplies information straight into the CUDA cores. A kernel usability uses registers to shop variables backstage to the thread and accessed frequently.
Both registers and shared representation are on-chip memories wherever variables domiciled successful these memories tin beryllium accessed astatine very precocious speeds successful a parallel manner.
By leveraging registers effectively, information reuse tin beryllium maximized and capacity tin beryllium optimized.
Cache Levels
Multiple levels of caches beryllium successful modern processors. The region to the processor is reflected successful the measurement these caches are numbered.
L1 Cache
L1 aliases level 1 cache is attached to the processor halfway directly. It functions arsenic a backup retention area erstwhile the magnitude of progressive information exceeds the capacity of a SM’s registry file.
L2 Cache
L2 aliases level 2 cache is larger and often shared crossed SMs. Unlike the L1 cache(s), location is only 1 L2 cache.
Constant Cache
Constant cache captures often utilized variables for each kernel starring to improved performance.
When designing representation systems for massively parallel processors, location will beryllium changeless representation variables. Rewriting these variables would beryllium redundant and pointless. Thus, a specialized representation strategy for illustration the changeless cache eliminates the request for computationally costly hardware logic.
New Memory Features pinch H100s
NVIDIA Hopper Streaming Multiprocessor. Figure taken from NVIDIA H100 White Paper.
Hopper, done its H100 statement of GPUs, introduced caller features to augment its capacity compared to erstwhile NVIDIA micro-architectures.
Thread Block Clusters
As mentioned earlier successful the article, Thread Block Clusters debuted pinch H100s, expanding the CUDA programming hierarchy. A Thread Block Cluster allows for greater programmatic power for a larger group of threads than permissible by a Thread Block connected a azygous SM.
Asynchronous Execution
The latest advancements successful asynchronous execution present a Tensor Memory Accelerator (TMA) and an Asynchronous Transaction Barrier into the Hopper architecture.
The Tensor Memory Accelerator (TMA) portion allows for the businesslike information transportation of ample blocks betwixt world and shared memory.
The Asynchronous Transaction Barrier allows for synchronization of CUDA threads and on-chip accelerators, sloppy of whether they are physically located connected abstracted SMs.
H100s incorporate some the Asynchronous Barriers introduced wrong the Ampere GPU architecture and the caller Asynchronous Transaction Barriers.
Conclusion
Assigning variables to circumstantial CUDA representation types allows a programmer to workout precise power complete its behaviour. This nickname not only determines really the adaptable is accessed, but besides the velocity astatine which this entree occurs. Variables stored successful representation types pinch faster entree times, specified arsenic registers aliases shared memory, tin beryllium quickly retrieved, accelerating computation. In contrast, variables successful slower representation types, specified arsenic world memory, are accessed astatine a slower rate. Additionally, representation type duty influences scope of the variable’s usage and relationship pinch different threads. The assigned representation type governs whether the adaptable is accessible to a azygous thread, a artifact of threads aliases each threads wrong a grid. Finally, H100s, the existent SOTA GPU for AI workflows, introduced respective caller features that power representation entree specified arsenic Thread Block Clusters, the Tensor Memory Accelerator (TMA) unit, and Asynchronous Transaction Barriers.
References
Programming Massively Parallel Processors (4th edition)
Hopper Whitepaper
CUDA Refresher: The CUDA Programming Model | NVIDIA Technical Blog