YouTube Transkripti:
HOTI 2023 - Day 1: Session 2 - Keynote by Bill Dally (NVIDIA): Accelerator Clusters

Videoları baştan sona izlemeye gerek yok — tam transkripti al, anahtar kelimeleri ara ve tek tıkla kopyala.

Paylaş:

AutoDub

YouTube'daki Yabancı Videoları Anla

YouTube'u Türkçe Seslendirmeyle İzle

Dil engellerini aş, dünyanın dört bir yanındaki kaliteli içeriklerin keyfini çıkar

Ücretsiz Kullan

Video Transkripti

Video Özeti

Summary

Core Theme

The keynote by Bill Dally emphasizes the critical role of high-performance networking in advancing Artificial Intelligence (AI), particularly in large-scale model training, and outlines current and future technological directions to meet these demands.

Mind Map

Genişletmek için tıkla

Tam etkileşimli Mind Map'i keşfetmek için tıkla

Okay,

today we are honored to have Bill Build Deli

Deli

So I would define uh uh build daily

keynotes as the yard stick by which we

measure advances in high performance

computing and high performance

which means that every few years you go

and listen uh to build a keynote you

know where we are in terms of state of

so this is Bill uh bio taken from the

web page obviously if we have to go

through this there wouldn't be any keynote

keynote

because there are so many achievements

and and it will take probably the full

hour or more. So I I today I'm going to

give a very short but very personal

introduction of Bill Delhi. So

this introduction is as follows. Believe

it or not be Bill you changed my life.

So I was a graduate student and I

stumbled in some lecture notes on a on a

summer school for students that you gave

I believe in Canada in B at the end of

the 80s. Okay. So this was early 90s for

me and then I came into your notes and I

had this uh I mean crystal clear

definition of what the network is. The

network is a topology is routing and is

a flow control. And guess what? The most

important thing is flow control is not

exactly the topology and the routing.

Okay, spoiler alert.

And then when I saw this, I said, who is

this William J. Delhi? This guy is

amazing. And then I went into this thing

and I discovered the virtual channels.

Okay. And then I went into some of your

papers. I mean I read this paper I'm not

kidding you maybe 10 or 15 times. Okay.

And uh so the short story is that the

Bill Delhi that we know today is not a

coincidence. Okay. So it's it was

already there. It's just doomed over

there. And without further ado I think can

can

with the key speaker.

Okay. So let me um share my screen.

Thank you for That was definitely a

blast from the past. I don't think I've

looked at my uh PhD thesis in several

decades. Um but uh you know it's it's

interesting you know how many things

stay the same and how many things

change. Um you know I uh wrote my first

paper on interconnection networks in the

1970s and then I worked at Caltech um

you know with with Tre sites on on the

cosmic cube and after we had the cosmic

cube working I was not happy with you

know I was writing programs for I was

not happy with the performance of the

interconnect and so you know spent a lot

of time thinking what an ideal

interconnect would be both in terms of

topology and in terms of flow control

and and that's where the you know

virtual channels and the move to uh you

lowdimensional Taurus topologies um came

about. Um so so the real focus then was

on running scientific applications. We

wanted to run big numerical simulations

of physical processes. Um and many

things are the same today but the

problem is different. What we really

want to do today is is is run AI because

AI is everywhere. We see it um you know

revolutionizing all all sorts of life.

It's uh already taking a big role in

education. I can see very quickly that

we're going to have individual AI tutors

for every student that knows how to

motivate them, understands how they

learn and can tailor the delivery of

material to, you know, basically both

motivate them into how they can best

absorb it. It's already making a huge

impact in medicine both in analyzing

images of all sorts as well as in, you

know, mining lots of data to do

diagnosis. Um, all forms of

entertainment are are using AI in one

form or another. um things like AI

copilot is already giving a 1.5x

improvement in productivity for

programmers and I expect that that will

um you know go get much larger and in

chip design we are already um you know

applying AI the little cartoon I have

here is for a program we have called

prefix RL that designs optimum carry

chains by viewing it almost as a video

game where you stick in the look ahead

element the little green dot on the

bottom on the bottom right and it is

done carry chains that beat any human

design and they're bizarre designs.

Humans would not come up with them. Um,

and so AI is everywhere and um, it's

really kind of fun being a hardware

designer because AI has been enabled by

by hardware and in particular by GPUs.

There are three ingredients that make AI

work. Um, they're algorithms that I sort

of illustrated by the I don't even see

the mouse. Um, illustrated by the Alexet

graph on the right here. Um, these

algorithms for the most part have been

around since the 1980s. um deep neural

networks, convolutional neural networks,

and training them with um back

propagation and stoastic gradient

descent have all been around since the

1980s. Um so it takes algorithms, it

takes data, large amounts of data

labeled or unlabeled um and then the

third ingredient is compute. And it

wasn't until we had enough compute that

we could train a large enough model and

a large enough data set um in a

reasonable amount of time that the AI

revolution really took off you know on

the order of of 10 years ago. Um now

that we have that compute um you can say

that was back in 2012 with AlexNet um

the growth has been phenomenal. um you

know during the time that people were

doing um largely confinets for images

those grew by about two orders of

magnitude in the demand the pedlop days

of training um over about a three-year

period now that we're in the large

language model regime um we see you know

three orders of magnitude over over a

three-year period and the little dot in

the upper right here is where I estimate

GPT4 is you know 10 to the six pedlop

days you can think of that as as a

thousand exop flop days, a thousand days

in a next flop machine to train GPT4 in

in 2023.

Um so um how how does the AI world

differ from the HPC world? Well, people

care about it. if if care is defined by

money. Um the the market for AI training

and inference is expected to be $300

billion in 2026

whereas the you know supercomputer

segment of the HPC market might might

hit 10 billion that year if if we're

lucky. Um so it's 30 times larger. Um

it's dominated by low precision. Um you

know you know FP16 and lower. Um we're

really using mostly FP8 um now. and and

I'm hoping that if we're very clever we

can even drive it down below that um as

opposed to you know typically you know FP64

FP64

the the dominant operation is a matrix

multiply and um when you look at how

it's used in these applications it's

limited by both commute compute and communication

communication

um you know it's interesting that there

there is an export restriction on our um

A100 and H100 parts now we cannot ship

them um to China. Um but we can ship the

A800 and A H800 that have one-third of

the communication bandwidth um to stay

under the um the needs of some of these

most demanding AI applications and also

the applications um have very well

understood persist persistent traffic

patterns. Now in contrast most HPC

applications are actually memory

bandwidth limited. Um it's too bad that

everybody runs a high performance LIN

impact benchmark because it's not the

that but the applications that people

really care about you know in

hydrodnamics in radiation transport in

in uh you know you know climate modeling

are all memory bandwidth limited and and

wind up using a tiny fraction of the

compute because they're just saturating

the memory and also because they're

saturating the memory they're typically

not communication limited so they're not

not actually stressing the network that

much which doesn't give us a hard enough

problem as network designers. Um but on

the AI side it is communication limited

and we understand the traffic pattern.

It's a large and growing market. So we

can specialize the network um for the

needs of AI. Um so what what does AI

need from the network and so you have to

look at how we get parallelism out of an

AI application. Um for as long as you

can you want to do what's called data

parallelism because it's very simple.

you simply create two copies of the AI

model you're running and um you

basically take your data set and you run

part of the data set on on one copy part

of the data set on the other copy and

they exchange parameters they exchange

the gradients I'll have a little more

detail on that in another slide um there

sometimes where you can't do that so for

example um just to hold the parameters

of GPT4 takes over 20 GPUs just to fit

in the memory um and so there you have

to break the model up and run part of

the model um on one device and part of

the model on the other device. There

kind of two ways you can slice this. If

you slice the model horizontally um

you're basically taking individual

matrices um and you basically

decomposing those matrices so that part

of the matrix is on one GPU and part of

the matrix is on another GPU. If you

slice this model vertically um you're

basically taking different layers of the

network and putting them on one GPU and

other iss on the other GPU. And that's

called pipeline parallelism versus

tensor parallelism. Um, and so if you

combine all three of these, the pipeline

parallelism and tensor parallelism

together are called model parallelism

because you're paralyzing the model. And

then the data parallelism is you're

parallelizing over the data set that

you're training on. Um, and if you

combine all three of these, you

basically get the most parallelism. And

you need that to train these large

language models. You need to run on

clusters of thousands to tens of

thousands of GPUs. um and it takes you

know 20 GPUs just to hold one copy of

the GPT4 model. You'll then have you

know a hundred of those copies for you

know 2,00 GPU training um regime. So

let's start with with data parallelism.

So each individual um you know you know

GPU and data parallelism runs the whole

model. It takes a batch of training data

whether it's images or whether it's

tokens um if you're doing um large

language models and you do um the you

know forward and backward pass over that

batch and you compute a set of gradients

which are the changes that you're going

to apply to the parameters and then you

want to apply those changes not just to

your parameters but if you've got a

thousand GPUs all you know running a you

know a thousand different you know

batches um you're going to um I should

say subsets of that batch batch. Um, you

want to combine those gradients and

apply them all to the parameters at

once. So, if you had a batch of 256

images and you ran it over 128 GPUs,

you'd be running two images on on each

GPU. We usually run much larger batches

over even larger numbers of of GPUs. But

the the operation here that you want

your network to be really good at is all

reduce, right? Everybody is basically

adding a bunch of numbers, you know,

into this um set of parameters. And so

you want to basically take those do the

all reduce for each parameter and then

you know some then distribute them um to

everybody. Um for model parallelism

there are a couple ways if you're doing

the tensor parallelism there are a

couple ways you can slice um the tensor.

You can split you know x column wise and

a rowwise or you can split oops

jump too fast. You can split a

columnwise and and x row-wise. And you

really want to do the latter because if

you do it um this way um you wind up

having to do an add um you know

synchronize make make sure these are

both done and then do the ad and then do

the jello where if that's a nonlinear

operation. Um if you um do it the other

way it's completely independent. You can

actually do these two jelloss and you

just output the data. There's no

synchronization no global ad that's

required. Um so if you look at what you

do need out of the network here um on

the forward pass you need to take this

input X and basically copy it to both

sides. So this is basically you know a a

broadcast. Um and on the output it's all

reduced. And remember you're typically

not just doing it two ways. You're

typically doing it you know 10 or 100

ways. Um so it's a big broadcast and a

big all reduce. Um and then on the back

propagation it's exactly reversed. You

wind up doing a broadcast with G and an

all reduce with F. Um the other place

you see a lot of communication in neural

networks is in recommener systems and

the communication there is a need to

access these large embedding tables.

It's not unusual to have recommenders

where the aggregate embedding tables are

terabytes. They don't fit on one GPU or

even one CPU node. And so you wind up

having a communication of taking the um

you know the the words you're trying to

look up and running and accessing the

embedding tables for them. there's often

a reduction within the embedding table

and then and then a communication back.

So how do we meet this need today? Um so

today at NVIDIA we offer something

called a DGX super pod. So let me walk

you through a super pod um starting at

the individual component and working

upward. So the individual GPU is a

Hopper H100. Um the GPU chip itself is

this little rectangle in the middle. Um

these uh six darker rectangles around it

are stacks of HPM3 memory. Um there's an

aggregate 94 gigabytes of HVM3 memory um

with 3.4 terabytes per second of

bandwidth into the GPU. It's an enormous

amount of bandwidth. Um everybody always

thinks these things around on the SXM

module are memory chips. These are

inductors. Delivering the power to this

thing. Um and you know 700 watts at

about 7 volts is a kilmp of power and

and doing that efficiently is actually

quite a challenging technical problem.

Um there's a lot of neat features on on

the H100 like the transformer engine to

uh you know um facilitate the use of

reduced precision in running modern

transformer models. We actually have

dynamic programming instructions for

bioinformatics. But for the purpose of

this discussion on networking, those

aren't particularly relevant. Um the way

to think about this is it's a component

of our system that delivers four pedlops

of sparse FPA performance and 900

gigabytes per second of external

bandwidth. That's the bandwidth of the

NVL links coming out of of this card. at

700 watts. Um so the next step up in

building the DJX super pot is we take

eight of these you know stacked here on

the board and we put them in a system

along with four um you know third

generation NV switches and um each of

these uh you know GPUs has you know um

18 NVLink channels coming out of it.

they're spread across, you know, five

and four, five and four across the NV

switches and then those are connected to

the back panel. Um, so they're actually

18 NV links that come out of the back

panel. And and so it's a it's one way to

think about it, it's an 8:1 taper at

this level from the bandwidth you have

within these eight GPUs and the next

level of of your network hierarchy. Um

this is the wiring diagram and and the

main reason I'm showing you this is just

to show that there are two separate

networks here. Um the um H100s are are

the GPUs here and the NV links on the

H100's um connect down to the NV

switches. Um you know you know each of

these is four and five four and five

going across. Um so that you know the

the the the uh each one HH100 can have

full bandwidth all 18 of its links

talking to um any of the other H100s

within its cluster.

And um it if if you excluded everybody

else it could actually get all the

bandwidth out the back panel as well.

But there is an 8 to1 taper at these NB

um NV switches. Um and that goes to the

180FP connections for NV link out the

back panel. Then the PCI connections out

of the GPU um go to connect connectx7

nicks uh via PCIe switch and those then

connect to four OSFP connectors, two

CX7s on each OFP um to build the

Infiniban network. Um so there's two two

separate networks and we tend to think

of this as a the NVLink network as a

scale up network and the Infiniban

network is a scale out network. Um both

of these networks um support sharp

acceleration. the NB link sharp

acceleration um is such that we wind up

um you know basically taking the N reads

that we would have to do with the read

and reduce on um A100 and an H100 with

sharp um it's basically one read we

basically do you know n reads to send

the partials the switch does the sum and

then we read the result that we want um

and the broadcast result works in kind

of a similar way we have to do only one

right and then we get n writes out of

the NB switch since this is

Basically, it's a 2x for for the all

reduces which are a huge part of the

deep learning um workload. This is the

2x reduction in in uh in demand or the

way to think about it is 2x increase in

the effective bandwidth we have. Um, so

we take those same NV switches um that

we used in the uh in the super pod and

we put two of them in a pizza box and um

that basically is the next level of

interconnect for the NV link the scale

up network. Um and um it winds up having

128 ports coming out um the front um

with uh those are spread across the 32

OSP cages and um enormous amount of of

you know 6.4 4 terabytes per second of

of bandwidth. Um, and that lets you

build up um and out depending on how you

want to do it. Um, each of the boxes

here is a u is eight GPUs, a DGX box.

You have 40 um per rack. Um, and then

you can either connect those up um with

NV links up to 256 GPUs, 32 of those

boxes, or Infiniban up to tens of

thousands. Um, and the Infiniban

director switches are sort of shown here

in the middle of this particular configuration.

configuration.

Um, and there's some real reasons to to

want to do this on the MVLink side. Um,

from the programming system point of

view, the MVLink network is a load store

network. So I can on a given GPU once

you've set the memory maps up, I can do

a load resertore operation into the

memory of any GPU on the NVLink network.

Um, and that just simplifies programming

um, compared to having to marshall your

data into message buffers and make the

MPI calls and everything else you have

to do over on the on the Infiniband

side. But there's also just a lot more

bandwidth here. Um and you know by

section bandwidth there's nine times as

much bandwidth on on the um you know um

NB link network um and uh 4.5 times as

much reduction bandwidth. Um so you know

whether it's on HPC applications um AI

inference um you know running up to 30x

speed speed ups or AI training up to 10x

speed ups there's big advantages to to

that that communication bandwidth and

bisection bandwidth um turn into um you

know for weak scaling if we increase the

size of the model in these large

language models you know you can start

at you know you know billion parameters

and run up to trillion parameters um you

know across that range we get perfectly

linear speed up as we increase the model

size and increase the number of GPUs.

What's more impressive is we also get

very close to linear strong scaling. If

we hold the model size constant, the GPT375

GPT375

billion parameter model um and we scale

it from 64 GPUs to 2K GPUs, it's nearly

linear speed up. Um and that that is a

testament to the very low overhead of um

of scaling on these. Um so so just to

sum up um you know the the current state

of affairs um with the DGX super pod is

um we have these you know wonderful GPUs

the H100s um each of which has 18 MV

link ports coming out of it. The um the

you know bo the DGX box takes eight of

those and four switches and does an 8

to1 reduction. You can then hook up to

you know 32 of those together to make a

256 GPU. um that looks like a big GPU

and and if you are willing to pay for

the switches you can hold that 8 to1

taper so that the bisection width is you

know 1/8 of the uh aggregate bandwidth

out of out of the GPUs um above that you

you hook it up with um infin switches

and and the infin network and you can

scale up to tens of thousands of GPUs

and and it the demands of the large

language models require that we see many

customers training on tens of thousands

of of of u of GPUs and and you know many

of the problems you have, you know, with

a large scientific computer having to

do, you know, checkpoint restart and the

like happen at that scale. Um, as as

well. In fact, many of these machines

would be number one on the top 500 list

if people would take them away from the

profitm operations that they're doing

long enough to run HPL on them. Um

so um that's where we are today and and

I should say one thing about the DGX

Super Pod is that you know compared to

sort of you know trying to acquire the

components of the system and put it

together and bring it up um which is the

way most large supercomputing um

acquisitions are done. Everything is is

preconfigured for the DGX super pod. So

if you buy DGX super pod you buy the DGX

boxes and the MVL link switches and the

Infiniban switches and you plug it

together it just works. All the software

is already done. It's been debugged and

the bring up is a day and not number of

months. So in terms of looking forward,

let's start with a physical layer. And

I'll start with a cartoon of you know

how one of these systems looks to the

logic designer. So you know on the chip,

we've got a bunch of logic that talks to

YouTube URL'sini Yapıştır

Tam transkripti almak için herhangi bir YouTube video bağlantısı gir

Transkriptlerin büyük çoğunluğu 5 saniyeden kısa sürede hazır

Chrome Uzantımızı Yükle

YouTube'dan ayrılmadan transkriptlere anında eriş. Chrome uzantımızı yükle ve izleme sayfasında tek tıkla herhangi bir videonun transkriptine ulaş.

Chrome'a Ekle — Ücretsiz

YouTube, Coursera, Udemy ve daha fazla eğitim platformuyla çalışır

Anında Transkript Al: Adres Çubuğundaki Alan Adını Değiştir!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranskriptiSonuçların hazırlanıyor…

YouTube Transkripti:HOTI 2023 - Day 1: Session 2 - Keynote by Bill Dally (NVIDIA): Accelerator Clusters