# **NVIDIA Ampere GA102**

Boyuan Tian, Vincent Wells, Katherine Yun

### **Applications benefit from GPU**



Learning

## GPU vs. CPU



Credit : https://sites.google.com/site/daveshshingari/explorations/computer-architecture/gpu-architecture

#### **Heterogeneous architecture**



Credit : https://sites.google.com/site/daveshshingari/explorations/computer-architecture/gpu-architecture

#### **Overview**

• 28.3 billion transistors

• Die area: 628.4 mm<sup>2</sup>

• Samsung's 8nm



• Graphics Processing Clusters (GPCs)



- Texture Processing Clusters (TPCs)
  - $\circ \qquad 6 \, \text{per GPC}$
  - $\circ$  41 per core



\_

- Streaming Multiprocessor (SMs)
  - $\circ \qquad 2\,\text{per TPC}$
  - $\circ$  82 per core



- Shared resources
  - L0 i-Cache
  - Warp scheduler
  - Dispatch
  - Register files

| SM                                   |                           |                                 |                     |                           |  |
|--------------------------------------|---------------------------|---------------------------------|---------------------|---------------------------|--|
| L0 i-Cache + Warp Scheduler + Dispat | tch (32 thread/clk)       | L0 i-Cache + War                | p Scheduler + Dispa | tch (32 thread/clk)       |  |
| Register File (16,384 x 32-bit)      |                           | Regist                          | er File (16,384 x   | 32-bit)                   |  |
| FP32 FP32<br>/<br>INT32              | TENSOR<br>CORE<br>3rd Gen | FP32<br>/<br>INT32              | FP32                | TENSOR<br>CORE<br>3rd Gen |  |
| LD/ST LD/ST LD/ST LD/                | ST SFU                    | LD/ST LD/ST                     | LD/ST LD            | /st SFU                   |  |
| L0 i-Cache + Warp Scheduler + Dispa  | tch (32 thread/clk)       | L0 i-Cache + War                | p Scheduler + Dispa | tch (32 thread/clk)       |  |
| Register File (16,384 x              | 32-bit)                   | Register File (16,384 x 32-bit) |                     |                           |  |
| FP32 FP32<br>/<br>INT32              | TENSOR<br>CORE<br>3rd Gen | FP32<br>/<br>INT32              | FP32                | TENSOR<br>CORE<br>3rd Gen |  |
| LD/ST LD/ST LD/ST LD/                | st SFU                    | LD/ST LD/ST                     | LD/ST LD            | /ST SFU                   |  |
| 128KB L1 Data Cache / Shared Memory  |                           |                                 |                     |                           |  |
| Tex                                  | Tex                       | Тех                             |                     | Tex                       |  |
| RT CORE<br>2nd Generation            |                           |                                 |                     |                           |  |

#### • CUDA cores

- Arithmetic operation
- **128 per SM**
- 10496 per core

#### • Tensor cores

- Matrix-matrix multiplication
- 4 per SM
- 328 per core

#### • RT cores

- Ray tracing
- $\circ$  1 per SM
- o 82 per core

| INT32 | INT32 | FP32 | FP32 |  |
|-------|-------|------|------|--|
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
| INT32 | INT32 | FP32 | FP32 |  |
|       |       |      |      |  |



### **CUDA core**

• Fully pipelined ALUs and FPUs

- Ampere
  - 64 INT32 / FP32 + 64 FP32 / SM
- Volta, Turing
  - 64 INT32 + 64 FP32 / SM

| L0 i-Cache + Warp  | L0 i-Cache + Warp Scheduler + Dispatch (32 thread/clk) |            |                           |  |  |  |
|--------------------|--------------------------------------------------------|------------|---------------------------|--|--|--|
| Regist             | er File (16,3                                          | 384 x 32-l | bit)                      |  |  |  |
| FP32<br>/<br>INT32 | FP32                                                   |            | TENSOR<br>CORE<br>3rd Gen |  |  |  |
| LD/ST LD/ST        | LD/ST                                                  | LD/ST      | SFU                       |  |  |  |

#### **Tensor core**

- 4 x 4 matrix multiplication
- Multiply-Accumulate Operation (MAC):
  - 128 in total = 64 multiplications + 64 accumulations

D =

| 1 | A <sub>0,0</sub> | A <sub>0,1</sub> | A <sub>0,2</sub> | A <sub>0,3</sub> | ١ |
|---|------------------|------------------|------------------|------------------|---|
|   | A <sub>1,0</sub> | A <sub>1,1</sub> | A <sub>1,2</sub> | A <sub>1,3</sub> |   |
|   | A <sub>2,0</sub> | A <sub>2,1</sub> | A <sub>2,2</sub> | A <sub>2,3</sub> |   |
|   | A <sub>3,0</sub> | A <sub>3,1</sub> | A <sub>3,2</sub> | A <sub>3,3</sub> |   |

| ( | B <sub>0,0</sub> | B <sub>0,1</sub> | B <sub>0,2</sub> | B <sub>0,3</sub> | • |
|---|------------------|------------------|------------------|------------------|---|
|   | B <sub>1,0</sub> | B <sub>1,1</sub> | B <sub>1,2</sub> | B <sub>1,3</sub> |   |
|   | B <sub>2,0</sub> | B <sub>2,1</sub> | B <sub>2,2</sub> | B <sub>2,3</sub> |   |
|   | B <sub>3,0</sub> | B <sub>3,1</sub> | B <sub>3,2</sub> | B <sub>3,3</sub> |   |

| ( | C <sub>0,0</sub> | C <sub>0,1</sub>        | C <sub>0,2</sub> | C <sub>0,3</sub> |
|---|------------------|-------------------------|------------------|------------------|
|   | C <sub>1,0</sub> | $\mathbf{C}_{1,1}$      | C <sub>1,2</sub> | C <sub>1,3</sub> |
|   | C <sub>2,0</sub> | C <sub>2,1</sub>        | C <sub>2,2</sub> | C <sub>2,3</sub> |
|   | C <sub>3,0</sub> | <b>C</b> <sub>3,1</sub> | C <sub>3,2</sub> | C <sub>3,3</sub> |



Credit : Raihan, Md Aamir, Negar Goli, and Tor M. Aamodt. "Modeling deep learning accelerator enabled GPUs." 2019 ISPASS

#### **Tensor core**

- New feature in Ampere:
  - Sparsity 0
  - 2x Tensor core throughput 0
  - ~ 2x reduction in weights footprint and bandwidth 0
  - ~ No loss in inference accuracy 0



**Sparse Tensor** 

Core

Input activations







- Ray tracing:
  - Realistic simulate lighting
  - Physically correct
- Basic ray tracing
- Optimizations
  - $\circ \qquad \text{Accelerate intersection testing} \\$
  - $\circ \qquad {\sf Reduce \ the \ mesh \ search \ cost}$







• Ray tracing with RT cores

- Dedicated hardwares
  - Box intersection checking
  - $\circ \qquad {\rm Triangle\,intersection\,checking}$



- New features on Ampere
  - Concurrency on RT core and Tensor core



# Memory Hierarchy

- 7 Graphics Processing Clusters (GPCs)
  - L2 Cache (6133 KB)
  - $\circ$  12 32-bit memory controllers
    - Each paired with 512KB of L2 cache
- 84 Streaming Multiprocessors (SMs)
  - Combined L1 data cache/shared memory (128KB)
    - Increased by 33% compared to Turing
    - Configurable based on compute workloads
  - Each SM has 4 processing blocks (partitions)
    - Register file (64KB)
    - L0 instruction cache

| SМ                                                     |                                 |                           |                                                        |                     |                           |  |
|--------------------------------------------------------|---------------------------------|---------------------------|--------------------------------------------------------|---------------------|---------------------------|--|
| L0 i-Cache + Warp Scheduler + Dispatch (32 thread/clk) |                                 |                           | L0 i-Cache + Warp Scheduler + Dispatch (32 thread/clk) |                     |                           |  |
| Regist                                                 | Register File (16,384 x 32-bit) |                           | Regi                                                   | ster File (16,384 : | x 32-bit)                 |  |
| FP32<br>/<br>INT32                                     | FP32                            | TENSOR<br>CORE<br>3rd Gen | FP32<br>/<br>INT32                                     | FP32                | TENSOR<br>CORE<br>3rd Gen |  |
| LD/ST LD/ST                                            | LD/ST LD                        | /ST SFU                   | LD/ST LD/S                                             |                     | D/ST SFU                  |  |
| L0 i-Cache + Warr                                      | Scheduler + Dispa               | tch (32 thread/clk)       | L0 i-Cache + Wa                                        | rp Scheduler + Disp | atch (32 thread/clk)      |  |
| Regist                                                 | er File (16,384 x               | 32-bit)                   | Register File (16,384 x 32-bit)                        |                     |                           |  |
| FP32<br>/<br>INT32                                     | FP32                            | TENSOR<br>CORE<br>3rd Gen | FP32<br>/<br>INT32                                     | FP32                | TENSOR<br>CORE<br>3rd Gen |  |
| LD/ST LD/ST                                            | LD/ST LD                        | /ST SFU                   | LD/ST LD/S                                             | T LD/ST LC          | D/ST SFU                  |  |
|                                                        |                                 | 128KB L1 Data Cac         | he / Shared Memor                                      |                     |                           |  |
| Tex                                                    |                                 | Тех                       | Tex                                                    |                     | Tex                       |  |
|                                                        | RT CORE<br>2nd Generation       |                           |                                                        |                     |                           |  |

# L1 Data Cache/Shared Memory

- SM level memory
  - $\circ \qquad \text{Accessible by threads within a SM}$
- Unified architecture for shared memory, L1 data cache, and texture caching
- Workload-based reconfiguration
  - $\circ$   $\,$   $\,$  Up to 128 KB per SM  $\,$

# L1 Data Cache/Shared Memory Cont'd

- Configuration supported (compute mode)
  - 128 KB L1 + 0 KB Shared Memory
  - 0 .....
  - $\circ \qquad 64\,\text{KB}\,\text{L1}+64\,\text{KB}\,\text{Shared}\,\text{Memory}$
  - 28 KB L1 + 100 KB Shared Memory
- Graphics workloads and async compute
  - 64 KB L1 data/texture cache (32 KB on Turing)
  - 48 KB shared memory
- Features double shared memory bandwidth
  - 128 bytes/clock/SM (doubled compared to Turing)

# **GDDR6X Memory**

- New to Ampere family processors
  - Based on prior GDDR6 memory standard in 2018
- Peak memory bandwidth of 936 GB/sec with PAM4 signaling
  - Double I/O data transfer rate
  - Sends two bits on each clock edge (rising and falling edges)
  - Voltage levels are divided into 250 mV steps
    - 00, 01, 10, 11 (DDR technology)



**G6 SIGNALING** 

#### NEW G6X SIGNALING

4-level "PAM4" | 250mV Voltage Steps



# **RTX IO**

- Gen4 SSDs with up to 7GB/sec read bandwidth
- CPU file systems become a bottleneck in loading game memory data
- GPU-based lossless decompression
  - Reads remain compressed data and delivers to GPU for decompression
  - Removes decompression load from the CPU to GPU



#### **Memory Hierarchy Overview**



### **Parallelism Support**

- CUDA Taskgraphs
  - Dependency graph of GPU operations
  - Enable a "define-once/run-repeatedly" execution flow
  - Generally many independent operations to run in parallel on the available cores
  - $\circ \qquad {\sf A100} \ {\sf adds} \ {\sf hardware} \ {\sf features} \ {\sf to} \ {\sf accelerate} \ {\sf traversing} \ {\sf a} \ {\sf task} \ {\sf graph}$
- Can use MIG to divide a GPU into GPU instances and run in parallel

#### Multi-Instance GPU

- Multi-Instance GPU (MIG)
  - New feature which allows the GPU to partitioned into as many as 7 separate CUDA GPU instances
  - Each instance has its own path through the entire memory system (on-chip crossbar ports, L2 cache banks, mem. Controllers, DRAM address buses)
  - Especially useful for Cloud Service Providers

#### Multi-Instance GPU Example



### Multi-GPU

- 3rd Generation NVLink
  - Interconnect multiple GPUs on a node using NVSwitch
    - ~2x faster than previous generation
  - Allows for up to 600 Gb/sec total bandwidth out of 12 links on a given A100 GPU
    - ~10X faster t han PCIe Gen4



#### Multi-GPU Example



Note Third-Generation NVLink connectivity through NVSwitches.

#### Multi-GPU Performance

DGX A100: DGX A100 with 8x A100 using TF32 precision



DGX A100: DGX A100 with 8x A100 using INT8 with Structural Sparsity

#### **Multi-Node Parallelism**

- NVIDIA Magnum IO & Mellanox Interconnect Solutions
  - Full support for NVIDIA Magnum IO APIs, which accelerate multi-GPU, multi-node systems to maximize IO performance
  - Compatible with Mellanox Infiniband and Ethernet connections
  - Supports PCIe Gen 4 with SR-IOV which allows it support faster network interfaces cards like 200 Gbit/sec
    Mellanox ConnectX-6 VPI HDR Infiniband

# Any questions?

#### References

"NVIDIA A100 Tensor Core GPU Architecture." NVIDIA, 2020.

J. Choquette and W. Gandhi, "NVIDIA A100 GPU: Performance & Innovation for GPU Computing," 2020 IEEE Hot Chips 32 Symposium (HCS), Palo Alto, CA, USA, 2020, pp. 1-43, doi: 10.1109/HCS49909.2020.9220622.

Y. Tsai, T. Cojean, and H. Anzt, "Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations," ArXiv, 2020, abs/2008.08478.

#### References

"NVIDIA Ampere GA102 GPU Architecture." NVIDIA, 2020.

"NVIDIA Turing GPU Architecture." NVIDIA, 2018.

"CUDA C++ Programming Guide". NVIDIA, 2020

Raihan, Md Aamir, Negar Goli, and Tor M. Aamodt. "Modeling deep learning accelerator enabled GPUs." 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019.

EECS 573 Microarchitecture Data Parallel Architectures: GPUs, Todd Austin, University of Michigan, 2014

Advanced Computer Architecture, Data-Level Parallel Architectures: GPUs, Paul Kelly, ICL, 2019

# Backup slides

## SM deep dive







#### **Tensor core**

• Mixed-precision Operation



- Ray tracing:
  - Realistic simulate lighting 0
  - Physically correct 0
- Basic ray tracing
- Optimizations
  - Accelerate intersection testing 0
  - Reduce the number of rays 0
    - Bounding volume hierarchy .



# A100 L2 Cache Memory

- A100 Tensor Core includes 40 MB of L2 cache
  - 6.7x larger than Tesla V100 L2 Cache
  - L2 cache is divided into two partitions to enable higher bandwidth
    - Each is divided into 40 L2 cache slices
    - 8 512 KB L2 slices are associated with each memory controller
- Compute Data Compression
  - Saves up to 4x DRAM read/write bandwidth,
  - $\circ$  Saves up to 4x L2 read bandwidth, and up to 2x L2 capacity.



# A100 HBM2 DRAM Subsystem

- Memory stacks located on the same physical package as the GPU
- A100 GPU includes 40 GB of fast HBM2 DRAM
- HBM2 delivers 1555 GB/sec memory bandwidth
  - With 1215 MHz (DDR) data rate
  - $\circ$  1.7 higher than Tesla V100
- Error Correction Code (ECC)
  - Provides higher reliability for compute applications that are sensitive to data corruption
  - Important in large-scale cluster environments

### Parallelism Support -- Multi-Instance GPU



Example of multiple independent GPU Compute workloads running in parallel using a MIG configuration on an A100 GPU with three GPU Instances and variable numbers of Compute Instances within each GPU Instance.