



# **Techtonic** 2021

Partner

Disrupt

## 초거대 AI 연구를 위한 HW / SW 기반 기술 이해



#### Discussion

- Supercomputer Architecture for Hyperscale AI Research
- Distributed Training for Large-scale NLP Research on Supercomputer
- Next-generation Supercomputer System Architecture

## Supercomputer Architecture for Hyperscale Al Research

## NVIDIA A100 80GB GPU

Highest Performing AI Supercomputing GPU



## SXM FOR MULTI-GPU

Highest Performing AI Supercomputing GPU



2X Simulation

Quantum Espresso

**2X Big Data Analytics** 10 TB Retail Benchmark **AI Training** DLRM Recommender

1.25X

MIG Inference RNN-T Speech Recognition Energy Efficiency Shatters 25 GF/W

1.25X



Speedups Normalized to Number of GPUs | Comparisons to A100 40GB | Measurements performed on DGX A100 servers | Training: DLRM, Huge CTR, Criteo Terabute Click Logs (TIB) dataset, DGX A100: 146x A100 40GB vs. 8x A100 80GB, Normalized throughput=2.6X | Data Analytics: big data benchmark with RAPIDS(0.16), BlazingSQL(0.16), DASK(2.2.0), 30 analytical retail queries, ETL, ML, NLP, 96x A100 40GB v: 48x A100 80GB, Normalized throughput= 1.9X | HPC: Quantum Espresso - CHT10POR8, 40x A100 40GB vs 20x A100 80GB, Normalized throughput=1.8X |

Al Inference: RNN-T (MLPerf 0.7 Single stream latency), DGXA100: A100 40GB vs A100 80GB on 1MIG@10GB when configured for 7MIC

## PCIE FOR SINGLE GPU

Highest Performing AI Supercomputing GPU

Flexible Deployment Option for Mainstream OEM Servers

Excellent Upgrade Path for V100 32GB PCIE Customers



A100 40GB PCIE and A100 80GB PCIE using GIGABYTE G482-252-00 AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome) HT Off System memory 512GB @ 3.2 GHz; V100 32GB PCIE using SMC SYS-4029GP-TRT Gold 6240@2GHz 3.3GHz Turbo (Cascade Lake) HT On System memory 384GB @ 7.GHz; Driver R470. Chroma szscl21\_24\_128 Total Time (s) x1 GPU F922 NCCL 2.8.4 | DLRM Training 1 GPU BS 32768; PyTorch FP32/TF32; cuDNN 8.2.0.41; NCCL 2.9.6; DL 21.04 | BERT Large Inference TensorFlow FP32/TF32 BS 8 Sequence Length 384 XLA NCC 21.04 FP32/TF32 | CuFFT – NVLINK FP32; 1583x456384

## **BUT DATA AND MODEL SIZE IS EXPLODING**

NVIDIA and Microsoft train 530B MT-NLG model using DeepSpeed and Megatron



| Dataset               | Dataset source                | Tokens<br>(billions) | Weight<br>(%) | Epoch<br>s |
|-----------------------|-------------------------------|----------------------|---------------|------------|
| Books3                | Pile dataset                  | 25.7                 | 14.3          | 1.5        |
| OpenWebText2          | Pile dataset                  | 14.8                 | 19.3          | 3.6        |
| Stack Exchange        | Pile dataset                  | 11.6                 | 5.7           | 1.4        |
| PubMed<br>Abstracts   | Pile dataset                  | 4.4                  | 2.9           | 1.8        |
| Wikipedia             | Pile dataset                  | 4.2                  | 4.8           | 3.2        |
| Gutenberg (PG-<br>19) | Pile dataset                  | 2.7                  | 0.9           | 0.9        |
| BookCorpus2           | Pile dataset                  | 1.5                  | 1.0           | 1.8        |
| NIH ExPorter          | Pile dataset                  | 0.3                  | 0.2           | 1.8        |
| Pile-CC               | Pile dataset                  | 49.8                 | 9.4           | 0.5        |
| ArXiv                 | Pile dataset                  | 20.8                 | 1.4           | 0.2        |
| GitHub                | Pile dataset                  | 24.3                 | 1.6           | 0.2        |
| CC-2020-50            | Common Crawl (CC)<br>snapshot | 68.7                 | 13.0          | 0.5        |
| CC-2021-04            | Common Crawl (CC)<br>snapshot | 82.6                 | 15.7          | 0.5        |
| RealNews              | RealNews                      | 21.9                 | 9.0           | 1.1        |
| CC-Stories            | Common Crawl (CC) stories     | 5.3                  | 0.9           | 0.5        |

https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

## **ISSUE:** LIMITED MEMORY SIZE IN BIG MODEL TRAINING

#### **NVIDIA Megatron**

| Model<br>size | Hidden size | Number of<br>layers | Number of<br>parameters<br>(billion) | Model-parallel<br>size | Number of<br>GPUs | Batch size | Achieved<br>teraFlOPs per<br>GPU | Percentage of<br>theoretical<br>peak FLOPs | Achieved<br>aggregate<br>petaFLOPs |
|---------------|-------------|---------------------|--------------------------------------|------------------------|-------------------|------------|----------------------------------|--------------------------------------------|------------------------------------|
| 1.7B          | 2304        | 24                  | 1.7                                  | 1                      | 32                | 512        | 137                              | 44%                                        | 4.4                                |
| 3.6B          | 3072        | 30                  | 3.6                                  | 2                      | 64                | 512        | 138                              | 44%                                        | 8.8                                |
| 7.5B          | 4096        | 36                  | 7.5                                  | 4                      | 128               | 512        | 142                              | 46%                                        | 18.2                               |
| 18B           | 6144        | 40                  | 18.4                                 | 8                      | 256               | 1024       | 135                              | 43%                                        | 34.6                               |
| 39B           | 8192        | 48                  | 39.1                                 | 16                     | 512               | 1536       | 138                              | 44%                                        | 70.8                               |
| 76B           | 10240       | 60                  | 76.1                                 | 32                     | 1024              | 1792       | 140                              | 45%                                        | 143.8                              |
| 145B          | 12288       | 80                  | 145.6                                | 64                     | 1536              | 2304       | 148                              | 47%                                        | 227.1                              |
| 310B          | 16384       | 96                  | 310.1                                | 128                    | 1920              | 2160       | 155                              | 50%                                        | 297.4                              |
| 530B          | 20480       | 105                 | 529.6                                | 280                    | 2520              | 2520       | 163                              | 52%                                        | 410.2                              |
| 1T            | 25600       | 128                 | 1008.0                               | 512                    | 3072              | 3072       | 163                              | 52%                                        | 502.0                              |

## **DISTRIBUTED TRAINING IS NECESSARY**

- Data Parallelism
- Model Parallelism

#### DATA PARALLELISM VS. MODEL PARALLELISM





#### Model Parallelism

Data Parallelism

## DATA COMMUNICATION IN DISTRIBUTED COMPUTING

MPI (Message Passing Interface) - https://www.mpi-forum.org/

- API for sending and receiving messages between tasks or processes
- A way of data communication between distributed processes
- Point-to-point communication & Collective communication



## **MESSAGE PASSING IN GPU SYSTEMS**

Collective Communication is Important in Large-scale GPU cluster



## **INSIDE GPU SERVER – V100 NVLINK INTERCONNECT**

No NVSwitch



**GPU NVLINK Topology** 



## **INSIDE GPU SERVER – A100 NVLINK INTERCONNECT**

#### No NVSwitch in 4 GPU node and NVSwitch in 8 GPU node



Nvlink without Nvswitch 200GB/s



Nvlink with Nvswitch 600GB/s

## **INSIDE GPU SERVER – PCIE INTERCONNECT**

Hierarchical Topology with PCIe Switch



## NETWORK INTERCONNECT FOR GPU CLUSTER

200G HDR Infiniband with Non-blocking FAT-tree Topology



## **GPUDIRECT**

#### Direct Data Communication Between GPU and Peripheral Devices

#### 100%-Host Memory Headers Payload Space Application User 0% 1 Full copy operations 0 Rivermax (4) Trieser the GPU 2 PCle transactions 2 Kernel (2) DMA Packet to (3) DMA Packet **GPU** utilization from Host Memory Host Memory To GPU memory **CPU** usage **GPU** Memory Ę Payload 3 $\bigcirc$ $\odot$ PCle Latency 👁 NVIDIA. 1 (1) Receives incoming packets

#### Classic data processing



#### GPUDirect RDMA

## **INTER-GPU COMMUNICATION**

Need to consider heterogeneous environment



## NCCL (NVIDIA COLLECTIVE COMMUNICATION LIBRARY)

Optimized Inter-GPU Communication Library in a Large-scale GPU cluster



## NCCL ARCHITECTURE

Optimized for All Platforms

Topology Detection





Graph Search





#### Graph Connect

Optimized CUDA Kernels



#### SUMMARY

- Hyperscale AI Research: Supercomputer needed
- Distributed Training: Model Parallelism + Data Parallelism
- Supercomputer: SW platform should understand HW architecture well

## Distributed Training for Large-scale NLP Research on Supercomputer

## **NVIDIA MEGATRON-LM**

Transformer-based Framework for Training Multi-Billion Parameter Language Model

- Optimized for Training Big NLP model
  - Model Parallel (Tensor / Pipeline Parallel)
  - Data Parallel
  - Multi-Node Training
  - Automatic Mixed Precision (FP16)

Repo: <a href="https://github.com/NVIDIA/Megatron-LM">https://github.com/NVIDIA/Megatron-LM</a>



#### MODEL PARALLELISM IN TRANSFORMER-BASED MODEL

- Intra-layer (Tensor) Parallelism
  - Parallel GEMM (General Matrix Multiplication)

- Inter-layer (Pipeline) Parallelism
  - Minibatch splitting and Pipeline bubble





https://github.com/NVIDIA/Megatron-LM

## HOW TENSOR PARALLELISM IS WORKING

Row-wise Parallel GEMMs



## HOW TENSOR PARALLELISM IS WORKING

Column-wise Parallel GEMMs



X

 $Y = [Y_1, Y_2]$ 

## HOW TENSOR PARALLELISM IS WORKING

How Tensor Parallelism is Working in Linear Layer

Row Parallel Linear Layer



Column Parallel Linear Layer

How Tensor Parallelism is Working in Fused MLP





How Tensor Parallelism is Working in Fused Self-Attention



Techtonic 2021

Attention

Dropout

Attention Output

Self Attention & Attention Dropout

How Tensor Parallelism is Working in Fused Self-Attention

Fused Self-Attention:





Techtonic 2021

📀 NVIDIA

Putting it All Together



#### HOW PIPELINE PARALLELISM IS WORKING



#### HOW PIPELINE PARALLELISM IS WORKING



#### HOW PIPELINE PARALLELISM IS WORKING



#### **TENSOR PARALLELISM VS. PIPELINE PARALLELISM IN GF**

#### **Tensor Parallelism**



Communication expensive

Good performance across batch sizes

#### **Pipeline Parallelism**



#### **Communication cheap**

Good performance at larger batch sizes (pipeline stall amortized)

## HYPERSCALE LM TRAINING IN MEGATRON-LM

- Model Parallelism: Architecture-dependent NCCL
  - Tensor Parallelism: Intra-node communication using NVLink
  - Pipeline parallelism: Inter-node communication using Infiniband
- Data Parallelism
  - Data Sharding for Reducing Training Time



#### HYPERSCALE LM TRAINING IN MEGATRON-LM



## SCALABILITY IN MEGATRON-LM

#### Almost Linear Scaling Efficiency

| Model<br>size | Hidden size | Number of<br>layers | Number of<br>parameters<br>(billion) | Model-parallel<br>size | Number of<br>GPUs | Batch size | Achieved<br>teraFIOPs per<br>GPU | Percentage of<br>theoretical<br>peak FLOPs | Achieved<br>aggregate<br>petaFLOPs |
|---------------|-------------|---------------------|--------------------------------------|------------------------|-------------------|------------|----------------------------------|--------------------------------------------|------------------------------------|
| 1.7B          | 2304        | 24                  | 1.7                                  | 1                      | 32                | 512        | 137                              | 44%                                        | 4.4                                |
| 3.6B          | 3072        | 30                  | 3.6                                  | 2                      | 64                | 512        | 138                              | 44%                                        | 8.8                                |
| 7.5B          | 4096        | 36                  | 7.5                                  | 4                      | 128               | 512        | 142                              | 46%                                        | 18.2                               |
| 18B           | 6144        | 40                  | 18.4                                 | 8                      | 256               | 1024       | 135                              | 43%                                        | 34.6                               |
| 39B           | 8192        | 48                  | 39.1                                 | 16                     | 512               | 1536       | 138                              | 44%                                        | 70.8                               |
| 76B           | 10240       | 60                  | 76.1                                 | 32                     | 1024              | 1792       | 140                              | 45%                                        | 143.8                              |
| 145B          | 12288       | 80                  | 145.6                                | 64                     | 1536              | 2304       | 148                              | 47%                                        | 227.1                              |
| 310B          | 16384       | 96                  | 310.1                                | 128                    | 1920              | 2160       | 155                              | 50%                                        | 297.4                              |
| 530B          | 20480       | 105                 | 529.6                                | 280                    | 2520              | 2520       | 163                              | 52%                                        | 410.2                              |
| 1T            | 25600       | 128                 | 1008.0                               | 512                    | 3072              | 3072       | 163                              | 52%                                        | 502.0                              |

https://github.com/NVIDIA/Megatron-LM

## **Next-generation Supercomputer Architecture**

#### LIMITS OF EXISTING COMPUTER ARCHITECTURE



## **NVIDIA GRACE**

#### Available in 2023



- ARM for Datacenter CPU
- NVLink between CPU and GPU
- LPDDR5x with ECC

#### **EVOLING DATACENTER COMPUTING ARCHITECTURE**

43





## **GPUDIRECT STORAGE**

#### WITHOUT GPUDIRECT STORAGE



Low Bandwidth | High Latency | Limited Capacity

Higher Bandwidth | Lower Latency O(PB) capacity | CUDA programming model

WITH GPUDIRECT STORAGE



## REFERENCE

- NVIDIA Megatron: <u>https://github.com/NVIDIA/Megatron-LM</u>
- NVIDIA A100: <u>https://www.nvidia.com/en-us/data-center/a100/</u>
- DGX SuperPOD: <u>https://images.nvidia.com/aem-dam/Solutions/Data-Center/gated-resources</u> /nvidia-dgx-superpod-a100.pdf
- NCCL: <u>https://developer.nvidia.com/nccl</u>
- GPUDirect: <u>https://developer.nvidia.com/gpudirect</u>
- NVIDIA Grace: <u>https://www.nvidia.com/en-us/data-center/grace-cpu/</u>
- MT-NLG: <u>https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megat</u> <u>ron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/</u>
- Microsoft Deepspeed: <u>https://github.com/microsoft/DeepSpeed</u>

# Thank you