The keynote by Bill Dally emphasizes the critical role of high-performance networking in advancing Artificial Intelligence (AI), particularly in large-scale model training, and outlines current and future technological directions to meet these demands.
Mind Map
Genişletmek için tıkla
Tam etkileşimli Mind Map'i keşfetmek için tıkla
Okay,
today we are honored to have Bill Build Deli
Deli
So I would define uh uh build daily
keynotes as the yard stick by which we
measure advances in high performance
computing and high performance
which means that every few years you go
and listen uh to build a keynote you
know where we are in terms of state of
so this is Bill uh bio taken from the
web page obviously if we have to go
through this there wouldn't be any keynote
keynote
because there are so many achievements
and and it will take probably the full
hour or more. So I I today I'm going to
give a very short but very personal
introduction of Bill Delhi. So
So
this introduction is as follows. Believe
it or not be Bill you changed my life.
So I was a graduate student and I
stumbled in some lecture notes on a on a
summer school for students that you gave
I believe in Canada in B at the end of
the 80s. Okay. So this was early 90s for
me and then I came into your notes and I
had this uh I mean crystal clear
definition of what the network is. The
network is a topology is routing and is
a flow control. And guess what? The most
important thing is flow control is not
exactly the topology and the routing.
Okay, spoiler alert.
And then when I saw this, I said, who is
this William J. Delhi? This guy is
amazing. And then I went into this thing
and I discovered the virtual channels.
Okay. And then I went into some of your
papers. I mean I read this paper I'm not
kidding you maybe 10 or 15 times. Okay.
And uh so the short story is that the
Bill Delhi that we know today is not a
coincidence. Okay. So it's it was
already there. It's just doomed over
there. And without further ado I think can
can
with the key speaker.
Okay. So let me um share my screen.
Thank you for That was definitely a
blast from the past. I don't think I've
looked at my uh PhD thesis in several
decades. Um but uh you know it's it's
interesting you know how many things
stay the same and how many things
change. Um you know I uh wrote my first
paper on interconnection networks in the
1970s and then I worked at Caltech um
you know with with Tre sites on on the
cosmic cube and after we had the cosmic
cube working I was not happy with you
know I was writing programs for I was
not happy with the performance of the
interconnect and so you know spent a lot
of time thinking what an ideal
interconnect would be both in terms of
topology and in terms of flow control
and and that's where the you know
virtual channels and the move to uh you
lowdimensional Taurus topologies um came
about. Um so so the real focus then was
on running scientific applications. We
wanted to run big numerical simulations
of physical processes. Um and many
things are the same today but the
problem is different. What we really
want to do today is is is run AI because
AI is everywhere. We see it um you know
revolutionizing all all sorts of life.
It's uh already taking a big role in
education. I can see very quickly that
we're going to have individual AI tutors
for every student that knows how to
motivate them, understands how they
learn and can tailor the delivery of
material to, you know, basically both
motivate them into how they can best
absorb it. It's already making a huge
impact in medicine both in analyzing
images of all sorts as well as in, you
know, mining lots of data to do
diagnosis. Um, all forms of
entertainment are are using AI in one
form or another. um things like AI
copilot is already giving a 1.5x
improvement in productivity for
programmers and I expect that that will
um you know go get much larger and in
chip design we are already um you know
applying AI the little cartoon I have
here is for a program we have called
prefix RL that designs optimum carry
chains by viewing it almost as a video
game where you stick in the look ahead
element the little green dot on the
bottom on the bottom right and it is
done carry chains that beat any human
design and they're bizarre designs.
Humans would not come up with them. Um,
and so AI is everywhere and um, it's
really kind of fun being a hardware
designer because AI has been enabled by
by hardware and in particular by GPUs.
There are three ingredients that make AI
work. Um, they're algorithms that I sort
of illustrated by the I don't even see
the mouse. Um, illustrated by the Alexet
graph on the right here. Um, these
algorithms for the most part have been
around since the 1980s. um deep neural
networks, convolutional neural networks,
and training them with um back
propagation and stoastic gradient
descent have all been around since the
1980s. Um so it takes algorithms, it
takes data, large amounts of data
labeled or unlabeled um and then the
third ingredient is compute. And it
wasn't until we had enough compute that
we could train a large enough model and
a large enough data set um in a
reasonable amount of time that the AI
revolution really took off you know on
the order of of 10 years ago. Um now
that we have that compute um you can say
that was back in 2012 with AlexNet um
the growth has been phenomenal. um you
know during the time that people were
doing um largely confinets for images
those grew by about two orders of
magnitude in the demand the pedlop days
of training um over about a three-year
period now that we're in the large
language model regime um we see you know
three orders of magnitude over over a
three-year period and the little dot in
the upper right here is where I estimate
GPT4 is you know 10 to the six pedlop
days you can think of that as as a
thousand exop flop days, a thousand days
in a next flop machine to train GPT4 in
in 2023.
Um so um how how does the AI world
differ from the HPC world? Well, people
care about it. if if care is defined by
money. Um the the market for AI training
and inference is expected to be $300
billion in 2026
whereas the you know supercomputer
segment of the HPC market might might
hit 10 billion that year if if we're
lucky. Um so it's 30 times larger. Um
it's dominated by low precision. Um you
know you know FP16 and lower. Um we're
really using mostly FP8 um now. and and
I'm hoping that if we're very clever we
can even drive it down below that um as
opposed to you know typically you know FP64
FP64
the the dominant operation is a matrix
multiply and um when you look at how
it's used in these applications it's
limited by both commute compute and communication
communication
um you know it's interesting that there
there is an export restriction on our um
A100 and H100 parts now we cannot ship
them um to China. Um but we can ship the
A800 and A H800 that have one-third of
the communication bandwidth um to stay
under the um the needs of some of these
most demanding AI applications and also
the applications um have very well
understood persist persistent traffic
patterns. Now in contrast most HPC
applications are actually memory
bandwidth limited. Um it's too bad that
everybody runs a high performance LIN
impact benchmark because it's not the
that but the applications that people
really care about you know in
hydrodnamics in radiation transport in
in uh you know you know climate modeling
are all memory bandwidth limited and and
wind up using a tiny fraction of the
compute because they're just saturating
the memory and also because they're
saturating the memory they're typically
not communication limited so they're not
not actually stressing the network that
much which doesn't give us a hard enough
problem as network designers. Um but on
the AI side it is communication limited
and we understand the traffic pattern.
It's a large and growing market. So we
can specialize the network um for the
needs of AI. Um so what what does AI
need from the network and so you have to
look at how we get parallelism out of an
AI application. Um for as long as you
can you want to do what's called data
parallelism because it's very simple.
you simply create two copies of the AI
model you're running and um you
basically take your data set and you run
part of the data set on on one copy part
of the data set on the other copy and
they exchange parameters they exchange
the gradients I'll have a little more
detail on that in another slide um there
sometimes where you can't do that so for
example um just to hold the parameters
of GPT4 takes over 20 GPUs just to fit
in the memory um and so there you have
to break the model up and run part of
the model um on one device and part of
the model on the other device. There
kind of two ways you can slice this. If
you slice the model horizontally um
you're basically taking individual
matrices um and you basically
decomposing those matrices so that part
of the matrix is on one GPU and part of
the matrix is on another GPU. If you
slice this model vertically um you're
basically taking different layers of the
network and putting them on one GPU and
other iss on the other GPU. And that's
called pipeline parallelism versus
tensor parallelism. Um, and so if you
combine all three of these, the pipeline
parallelism and tensor parallelism
together are called model parallelism
because you're paralyzing the model. And
then the data parallelism is you're
parallelizing over the data set that
you're training on. Um, and if you
combine all three of these, you
basically get the most parallelism. And
you need that to train these large
language models. You need to run on
clusters of thousands to tens of
thousands of GPUs. um and it takes you
know 20 GPUs just to hold one copy of
the GPT4 model. You'll then have you
know a hundred of those copies for you
know 2,00 GPU training um regime. So
let's start with with data parallelism.
So each individual um you know you know
GPU and data parallelism runs the whole
model. It takes a batch of training data
whether it's images or whether it's
tokens um if you're doing um large
language models and you do um the you
know forward and backward pass over that
batch and you compute a set of gradients
which are the changes that you're going
to apply to the parameters and then you
want to apply those changes not just to
your parameters but if you've got a
thousand GPUs all you know running a you
know a thousand different you know
batches um you're going to um I should
say subsets of that batch batch. Um, you
want to combine those gradients and
apply them all to the parameters at
once. So, if you had a batch of 256
images and you ran it over 128 GPUs,
you'd be running two images on on each
GPU. We usually run much larger batches
over even larger numbers of of GPUs. But
the the operation here that you want
your network to be really good at is all
reduce, right? Everybody is basically
adding a bunch of numbers, you know,
into this um set of parameters. And so
you want to basically take those do the
all reduce for each parameter and then
you know some then distribute them um to
everybody. Um for model parallelism
there are a couple ways if you're doing
the tensor parallelism there are a
couple ways you can slice um the tensor.
You can split you know x column wise and
a rowwise or you can split oops
jump too fast. You can split a
columnwise and and x row-wise. And you
really want to do the latter because if
you do it um this way um you wind up
having to do an add um you know
synchronize make make sure these are
both done and then do the ad and then do
the jello where if that's a nonlinear
operation. Um if you um do it the other
way it's completely independent. You can
actually do these two jelloss and you
just output the data. There's no
synchronization no global ad that's
required. Um so if you look at what you
do need out of the network here um on
the forward pass you need to take this
input X and basically copy it to both
sides. So this is basically you know a a
broadcast. Um and on the output it's all
reduced. And remember you're typically
not just doing it two ways. You're
typically doing it you know 10 or 100
ways. Um so it's a big broadcast and a
big all reduce. Um and then on the back
propagation it's exactly reversed. You
wind up doing a broadcast with G and an
all reduce with F. Um the other place
you see a lot of communication in neural
networks is in recommener systems and
the communication there is a need to
access these large embedding tables.
It's not unusual to have recommenders
where the aggregate embedding tables are
terabytes. They don't fit on one GPU or
even one CPU node. And so you wind up
having a communication of taking the um
you know the the words you're trying to
look up and running and accessing the
embedding tables for them. there's often
a reduction within the embedding table
and then and then a communication back.
So how do we meet this need today? Um so
today at NVIDIA we offer something
called a DGX super pod. So let me walk
you through a super pod um starting at
the individual component and working
upward. So the individual GPU is a
Hopper H100. Um the GPU chip itself is
this little rectangle in the middle. Um
these uh six darker rectangles around it
are stacks of HPM3 memory. Um there's an
aggregate 94 gigabytes of HVM3 memory um
with 3.4 terabytes per second of
bandwidth into the GPU. It's an enormous
amount of bandwidth. Um everybody always
thinks these things around on the SXM
module are memory chips. These are
inductors. Delivering the power to this
thing. Um and you know 700 watts at
about 7 volts is a kilmp of power and
and doing that efficiently is actually
quite a challenging technical problem.
Um there's a lot of neat features on on
the H100 like the transformer engine to
uh you know um facilitate the use of
reduced precision in running modern
transformer models. We actually have
dynamic programming instructions for
bioinformatics. But for the purpose of
this discussion on networking, those
aren't particularly relevant. Um the way
to think about this is it's a component
of our system that delivers four pedlops
of sparse FPA performance and 900
gigabytes per second of external
bandwidth. That's the bandwidth of the
NVL links coming out of of this card. at
700 watts. Um so the next step up in
building the DJX super pot is we take
eight of these you know stacked here on
the board and we put them in a system
along with four um you know third
generation NV switches and um each of
these uh you know GPUs has you know um
18 NVLink channels coming out of it.
they're spread across, you know, five
and four, five and four across the NV
switches and then those are connected to
the back panel. Um, so they're actually
18 NV links that come out of the back
panel. And and so it's a it's one way to
think about it, it's an 8:1 taper at
this level from the bandwidth you have
within these eight GPUs and the next
level of of your network hierarchy. Um
this is the wiring diagram and and the
main reason I'm showing you this is just
to show that there are two separate
networks here. Um the um H100s are are
the GPUs here and the NV links on the
H100's um connect down to the NV
switches. Um you know you know each of
these is four and five four and five
going across. Um so that you know the
the the the uh each one HH100 can have
full bandwidth all 18 of its links
talking to um any of the other H100s
within its cluster.
And um it if if you excluded everybody
else it could actually get all the
bandwidth out the back panel as well.
But there is an 8 to1 taper at these NB
um NV switches. Um and that goes to the
180FP connections for NV link out the
back panel. Then the PCI connections out
of the GPU um go to connect connectx7
nicks uh via PCIe switch and those then
connect to four OSFP connectors, two
CX7s on each OFP um to build the
Infiniban network. Um so there's two two
separate networks and we tend to think
of this as a the NVLink network as a
scale up network and the Infiniban
network is a scale out network. Um both
of these networks um support sharp
acceleration. the NB link sharp
acceleration um is such that we wind up
um you know basically taking the N reads
that we would have to do with the read
and reduce on um A100 and an H100 with
sharp um it's basically one read we
basically do you know n reads to send
the partials the switch does the sum and
then we read the result that we want um
and the broadcast result works in kind
of a similar way we have to do only one
right and then we get n writes out of
the NB switch since this is
Basically, it's a 2x for for the all
reduces which are a huge part of the
deep learning um workload. This is the
2x reduction in in uh in demand or the
way to think about it is 2x increase in
the effective bandwidth we have. Um, so
we take those same NV switches um that
we used in the uh in the super pod and
we put two of them in a pizza box and um
that basically is the next level of
interconnect for the NV link the scale
up network. Um and um it winds up having
128 ports coming out um the front um
with uh those are spread across the 32
OSP cages and um enormous amount of of
you know 6.4 4 terabytes per second of
of bandwidth. Um, and that lets you
build up um and out depending on how you
want to do it. Um, each of the boxes
here is a u is eight GPUs, a DGX box.
You have 40 um per rack. Um, and then
you can either connect those up um with
NV links up to 256 GPUs, 32 of those
boxes, or Infiniban up to tens of
thousands. Um, and the Infiniban
director switches are sort of shown here
in the middle of this particular configuration.
configuration.
Um, and there's some real reasons to to
want to do this on the MVLink side. Um,
from the programming system point of
view, the MVLink network is a load store
network. So I can on a given GPU once
you've set the memory maps up, I can do
a load resertore operation into the
memory of any GPU on the NVLink network.
Um, and that just simplifies programming
um, compared to having to marshall your
data into message buffers and make the
MPI calls and everything else you have
to do over on the on the Infiniband
side. But there's also just a lot more
bandwidth here. Um and you know by
section bandwidth there's nine times as
much bandwidth on on the um you know um
NB link network um and uh 4.5 times as
much reduction bandwidth. Um so you know
whether it's on HPC applications um AI
inference um you know running up to 30x
speed speed ups or AI training up to 10x
speed ups there's big advantages to to
that that communication bandwidth and
bisection bandwidth um turn into um you
know for weak scaling if we increase the
size of the model in these large
language models you know you can start
at you know you know billion parameters
and run up to trillion parameters um you
know across that range we get perfectly
linear speed up as we increase the model
size and increase the number of GPUs.
What's more impressive is we also get
very close to linear strong scaling. If
we hold the model size constant, the GPT375
GPT375
billion parameter model um and we scale
it from 64 GPUs to 2K GPUs, it's nearly
linear speed up. Um and that that is a
testament to the very low overhead of um
of scaling on these. Um so so just to
sum up um you know the the current state
of affairs um with the DGX super pod is
um we have these you know wonderful GPUs
the H100s um each of which has 18 MV
link ports coming out of it. The um the
you know bo the DGX box takes eight of
those and four switches and does an 8
to1 reduction. You can then hook up to
you know 32 of those together to make a
256 GPU. um that looks like a big GPU
and and if you are willing to pay for
the switches you can hold that 8 to1
taper so that the bisection width is you
know 1/8 of the uh aggregate bandwidth
out of out of the GPUs um above that you
you hook it up with um infin switches
and and the infin network and you can
scale up to tens of thousands of GPUs
and and it the demands of the large
language models require that we see many
customers training on tens of thousands
of of of u of GPUs and and you know many
of the problems you have, you know, with
a large scientific computer having to
do, you know, checkpoint restart and the
like happen at that scale. Um, as as
well. In fact, many of these machines
would be number one on the top 500 list
if people would take them away from the
profitm operations that they're doing
long enough to run HPL on them. Um
so um that's where we are today and and
I should say one thing about the DGX
Super Pod is that you know compared to
sort of you know trying to acquire the
components of the system and put it
together and bring it up um which is the
way most large supercomputing um
acquisitions are done. Everything is is
preconfigured for the DGX super pod. So
if you buy DGX super pod you buy the DGX
boxes and the MVL link switches and the
Infiniban switches and you plug it
together it just works. All the software
is already done. It's been debugged and
the bring up is a day and not number of
months. So in terms of looking forward,
let's start with a physical layer. And
I'll start with a cartoon of you know
how one of these systems looks to the
logic designer. So you know on the chip,
we've got a bunch of logic that talks to
other logic over 2 millimeter links.
We've got our you know um links on the
interposer um out to the DRAM that uh
you know talks over very short links as
well. and and that's you know one GPU
and then we may have multiple GPUs like
on in the DGX box that's on a printed
circuit board where the connections are
now 30 to 50 centimeters
um you going tens of terabits per second
um which is a a big reduction in
bandwidth I'll point out compared to
what you had um on the GPU um and the
current electrical interconnect here is
around five pigles per bit um when we
need to go board to board we're now
going one to three meters um we've had
another reduction um in in bandwidth and
we're still electrical at about five
pigles per bit. We tend to go optical um
on these cabinet to cabinet links.
They're 5 to 100 meters which is too far
for an electrical link unless you repeat
it every meter or so. Um and the the
power is up, but more importantly the
cost goes way up here. The the
electrical links cost 10 times as much
per unit bandwidth these days um as the
electrical links. And this is worrisome
to us because the electrical links
aren't getting a whole lot better with
time. Um the chart on the left here
shows um that transistors well you get
more of them as you move from say the
the 16 nanometer node down to the five
nanometer node but they're not getting
faster. I mean the fan out of four
inverter delay has been stuck at about
10 picosconds. Um you know since you
know the this the 16 nanometer um
technology that we used way back on uh I
don't remember which GPU that is Pascal
or something was probably our first 16
nanometer um GPU. Um and then what's
also more worrisome is that as we push
our our electrical links, we now have
electrical links going 200 uh gigabits
per second per pair. Um the reach gets
reduced, right? We could go 2 meters
when it was 100 gig. We can now only go
one meter um at 200 gig. And so it means
that we have to jump over to optical
signaling um earlier. It also puts a big
amount of pressure on us to put a lot of
GPUs close together so we can connect
them electrically. and and some
customers kind of push back on this
because we tell them, you know, we want
to build a 200 kilowatt rack, right? And
that's you look at the big
supercomputers, that's what Frontier is.
And if you build a 200 to 300 kilowatt
rack, you jam a lot of GPUs into a small
space, you can connect them all
electrically. You can get to that 256
node Envy link network electrically, but
the and the person just needs to provide
you with cold water right at that rack
and liquid cool the thing. um
that's pushing back against sort of a
data center culture where you know
people like 30 kilowatt racks and they
think 40 is a stretch. Um actually some
of them like even less than 30. Anyway,
um if you look at at the physical layer
the figures of merit are power, cost,
density and reach. Um and then if you
you know sort of bring all the students
into the classroom, you give them the
test and you get the report card. Um
this is what it looks like. Um and the
real thing to do here is to compare the
electrical cable to the active optical
cable. Um and what you see is that you
know power is about the same and a
little bit more expensive for the AOC
and not not that much. The the big issue
here is the cost. Um the density is
better on the active optical cable and
the reach is way better. But to get that
density and reach you got to pay 10x as
much um you know per per gigabit per
second. Um, and you know, we have to
find some way of dealing with that. And
so, one thing we're very excited about
is co-ackaged optics using dense
wavelength division multipplexing
because our estimate is that if it's
successful, it will be even denser than
the current active optical cables at a
cost, this may be a little bit
optimistic, but within a factor or two
of of electrical signaling um, and a
power substantially less than the
electrical signaling um, and a reach
that's the same as the active optical
cable. Um now I should point out that
when you look at these figures of merit
like like especially the ones of power
and cost you have to consider the whole
link. I very often you know talk to
technology providers and they tell me
about this wonderful technology that
they have. It's almost no peak per bit
but they're only looking at the actual
transmitter and receiver. They're not
counting that an automatic advance.
you're not counting, you know, the
serialization and des serialization, you
know, the clock recovery um and and the
actual supply laser. So, you have to
look at the entire system. Um and one
thing I found with with, you know, some
of our um links is sometimes, you know,
we you wind up with overly enthusiastic
designers who overdes the link layer.
You have a very reliable um you know, um
you know, physical layer out here that
doesn't require enormous amounts of
error correction and and all sorts of
stuff. But, you know, they like to
overbuild things and so they'll put, you
know, you'll have a link that's
consuming, you know, um, you know, a
tenth of a picole per bit and they'll
wind up putting a two pajle per bit link
layer in front of which kind of defeats
the whole thing. Um, so let's talk a
little bit about optical signaling. A
system concept we've been playing with
is to um deliver switch cards that
basically have a um a a GPU switch like
the NBLink switch or one of our
Infiniban switches with co-ackaged
optics that basically come out in
pigtails and you can basically
connectorize them out um to a front
panel um with a bunch of of fiber
connectors um fiber ribbon connectors
and a GPU card with one or maybe a
couple GPUs on it with co-ackaged optics
for the NV links, bringing those out to
um a connectorized panel. You would then
package these up. So you would have a
GPU rack with all these GPU cards and a
switch rack somewhere with all these
switch cards. And because these are um
you know now um optical cables, the
reach is 100 meters, there's not this
huge pressure, you know, to build the
the you know 200 kilowatt rack. Um you
can spread them out a little bit,
although you don't want to do that
gratuitously because latency is is
badness, right? And if you make these
cables long, the latency is going to get
larger. Um what what this looks like at
the next level um is uh of a laser comb
source. We're hoping that quantum dot
sources will eventually become um you
know widely available, but right now
these are usually DFB arrays. Um that
you have a supply fiber that brings the
um um um the the comb into the
transmitter chip. And the transmitter
chip has a bunch of ring modulators that
basically either pass or attenuate each
line of the laser comb. Um, and they do
that at a mo moderate modulation rate.
And we're playing with things um, you
know, we're actually looking across the
design spectrum from having, you know, a
a few lines with very high modulation
rates to lots of lines with lower
modulation rates. So modulation rates
we're looking at range from like 25
gigabits per second up to 200 gigabits
per second. uh whatever trade-off you
make there. Um um what we want to do is
we want to have one extra line compared
to the data and and forward the clock
because then we don't have to recover
the clock on the far side. Um here um we
have a ring ring modulator we use to
pass only a selected line to a receiver
where we have a um a diode and a trans
impedance amplifier that um you
basically converts that signal back to
to electrical.
Um I think I basically said most of what
is on this slide. Um now here's sort of
a cross-section picture of what that
looks like. um you know on the switch
chip it sits on an interposer a um on
that interposer then we have a photonic
integrated circuit that has the optical
components um that that is the uh you
know um the rings and and uh the the
wave guides and the couplers and then we
have an electronic integrated circuit
that sits on top of that that basically
takes the you know very short reach link
from the switch um and drives the
modulators for the rings and and
basically has the trans impedance
amplifiers and and serializers and d
serializers. Um on the receive side um
essentially the same thing happens on
the co-ackage GPU. It'll be the same EIC
and the same pick um but just a single
GPU driving it rather than having you
know a switch which will have way more
bandwidth coming out of it. Um
the power budget we're currently looking
at for for an early prototype is about
3.5 ples per bit. The bulk of which is
the laser. I mean these um and I have to
say this is wall plug power, right? So
if you have a 5% efficient laser um you
have to put a bunch of picoles per bit
in to get not so many um coming out um
then then the uh EIC um that's basically
the serializers d serializers modulators
and the like um takes bulk of the power.
Yeah. And the optical thing is hugely
efficient once you have you know once
you have a a comb to modulate and the
electrical signals to drive those rings
they don't take very much power at all.
Um I think we can actually do better.
Right now we're budgeting um 250 fibles
per bit um for this um electrical link
from the uh host to the uh TX and RX and
I think we're actually going to be much
more likely in the in the 100 FETJs per
bit there. Um now the u the link budget
um we're we're playing with you know
after that you know you know 20 to one
you 5% efficient laser we have about
3dBm coming out of our laser you know we
lose a bunch at each of these couplers
every time we we go in and out of a chip
um so that we're down to you know 2.5
going in here we lose some um in the
transmit chip so we're kind of um coming
out of this trans transmit chip at minus
4 dBm Um and then in, you know, a couple
couple rows later into the receive chip
at at almost - 7. Um after we get across
the uh the rings, we're at minus 9 and
and then at the photo detector input
minus 12, which gives us a 2dB margin,
but we would like a much bigger margin
than this. And and and one reason is
we'd like to be able to use these with
um optical circuit switching where the
insertion loss that optical circuit
switch is bigger than 2dB, right? So we
we need to be able to tolerate more loss
on on the link. So, we're working on
trying to make every part of this a
little bit tighter and get a little bit
more link margin. Um, one of our first
prototypes, we're doing a 400 gigabit
per second per fiber, um, which is, um,
25 gigabits per second per um, color per
polarization, eight channels, um, 100
gigahertz channel spacing. We're looking
to scale this up to 800 and 1.6
terabytes per fiber by doing higher
bandwidth and and more channels. Um we
built a number of test chips over the
past couple years starting with um the
the numbers on these are usually the
year that we did them. RPC19 was done in
2019. You know testing micro rings um
you know doing different couplers doing
different um you know wave guides doing
different receive architectures and
we're very close to um you know putting
out a uh you know a full link that we we
hope we'll have operating in the very
near future. So that's a physical layer.
Um what do you do with that physical
layer? you build a topology and it still
amazes me that most people build folded
CL topologies that are often called fat
trees but Klo um wrote his paper in the
60s anderson wrote his in the 80s so
sort of a 20-year credit to to CL um and
um they they're just inefficient because
they sort of um assume the worst case.
you have to sort of route up and then
route back down instead of going exactly
where you want to go. Where the whole
idea of a dragonfly network is in the
good case you can do one long hop. You
only need to to ever uh you know go
across one expensive optical link
assuming you can wire the groups
electrically. Um and as a result of that
um using the assumptions that were in
the original um you know paper I'm
trying to remember you know when when
this came out of 2008 um you it's you
know $80 per endpoint rather than60
those constant numbers have changed but
the ratio hasn't um the and the reason
you get that is that it's um you know a
much lower diameter um and you have
essentially you know lower cost for the
same bisection bandwidth. The challenge
here is um it's really easy to route,
you know, a a fat tree. You basically
just route obliviously up, you know, you
can even route randomly up until you get
to a point where you can see your
destination and then you route, you
know, directly down to that destination.
Um routing a um dragonfly correctly
requires global adaptive routing. There
are otherwise traffic patterns that will
completely bring it to its knees. Um and
it's also very sensitive to congestion.
Um, so you need to have some sort of
congestion avoidance mechanism if you're
going to use um a dragonfly.
Um, now one of the classic problems with
the dragonfly is that building them out
um would require reabling, right? If you
if you had two groups and you wired them
together and then you added a third
group, now you have to take a third of
the connections from each of these two
groups and rewire it to that group. Um,
this problem goes away with optical
circuit switching. It also solves the
problem of partitioning um the dragonfly
up. Um if you have multiple users on it,
you can now give somebody, you know, if
say have 16 groups, give somebody eight
groups, somebody else four, somebody two
and and so on. And then all of their
connections don't have to go through
somebody else's group. They can all be
directly connected to each other. And
you can solve a reliability problem this
way by having a spare group and swapping
a spare group in making the group the
field replaceable unit. Um so dragon
flies and optical circuit switches are a
really nice um combination. Um so so
much for topology. Now that we have our
dragonfly, how are we going to route on
it? Now um it turns out that you get
congestion in two ways. One is you get
core congestion. Say when every um
endpoint, every GPU in this group wants
to talk to this group, they all try to
route minimally. They all want to use
this link and that will congest this one
global link. Um, you can also get
endpoint congestion. If everybody in
this group wants to talk to this one endpoint,
endpoint,
um, they're going to congest the
endpoint. Um, and these are two very
different problems, right? The the
problem on the left, um, of having, you
know, one group all going to another
group has to be solved by routing. It
requires global adaptive routing to
spread the load over the global links by
routing some of them non-minimally. The
problem on the right of endpoint
congestion requires source throttling.
Right? there's no way um of getting more
over this endpoint than what it
bandwidth is. Right? If you're trying to
send three times that much bandwidth,
it's going to all back up and cause tree saturation.
saturation.
Um so let's talk about the global
adaptive routing um problem first. So
say everybody in this um group, the
orange group here, wants to talk to
everybody in the green group. The simple
solution is one of them takes this link.
He gets to go minimally and then one of
them goes to the red group and one of
them goes to the blue group and and they
take the hop from the red group and the
blue group um to the green group. But
doing this requires making a decision to
route non-minimally. And it turns out
there's very good ways to to do that
that ultimately all spring from Arjun
Singh's work um on his PhD thesis um of
of global adaptive routing. The key of
global adaptive routing is it's not the
usual adapt you know the original
adaptive routing where people would sort
of opportunistically
you know go to a switch and make a local
decision to take an idle output port.
It's a global decision here where you're
choosing between routing minimally as as
the one on the right did here or routing
non-minimally as the other two did. And
when you route non-minimally, then what
you're really doing is you're basically
reverting to valance algorithm for the
global part of of your route by randomly
picking a group um to to route through.
And there are good ways of doing this.
Um and and it can be done in a very
successful manner. It's no at this point
I would say it's known technology. Um
then we get to the issue of flow control
and I I tend to agree with Frizzio. I
think all of these aspects, the the
topology, the routing, and the flow
control are all important, but the flow
control is often what really determines
um you know what fraction of your
network you really get to use. You'd
like that fraction to be close to 100%.
For a lot of networks, unfortunately,
it's like 30 or 40%. Um so, so here are
some thoughts on flow control. So, the
first thought is um and this sort of
relates to um there's coupling between
all those elements, coupling between
routing and flow control. I may have a
single GPU with 18 MV links coming out
of it and I want to send one really big
flow. Um, and I don't want to serialize
it over one of those 18 MV links. I want
to use all 18 NV links. So even right
coming out of the GPU, I cannot run a
flow all over the same path. That is not
a realistic thing to do. So if I'm going
to be running a big flow over 18 um
things coming out of here, they're going
to be going different ways, right? when
they get to the first switch, they may
choose to go in different routes and
they're already on different channels.
So, you're not going to keep ordering by
trying to force everything on one route.
So, you know, a single flow can use all
outgoing links and use lots of routes
through here. You need to reorder at the
far end. One thing that you want to do
to keep the reorder buffers small is
keep the worstase latency low. Um, and a
very simple way of keeping the worst
case latency low um is to have a very
short um time to wait. That's like time
to live, but you don't count the
productive time. If you're actually
moving through a switch or over a link,
that doesn't count. What counts is the
time you spend waiting. And if you wait
500 nconds, something's wrong. Drop the
packet. Um, and I'll explain a little
bit later on congestion control how we
avoid just dropping every packet under
under these circumstances. Um, and um,
another thing to realize is I've got
eight 18 links coming out of this GPU.
Um, that's a horrible thing to waste.
That's the first level of my switch.
That should be an 18-way switch. That
should be enough to connect up perhaps,
you know, a reasonable size dragonfly
group um where then um you know the the
you know the switches out of that can
can take that 18-way expansion and
expand it further to get to an even
larger reach. When you start looking at
at what you need in terms of buffers, it
really has to do with with two two
delays. Um you need a buffer at at the
source um to handle retransmits, right?
So, if I do drop the packet, I got to
have the packet back here so I can send
it again when I get the knack that says
I dropped it or or timeout. The knack
tends to be a little more efficient. And
I need a comparable size buffer at the
receiving end to reorder. Um, if you
know, one packet gets through and then
and then another packet of the same
message gets dropped, I need to wait
until um, you know, that it comes in so
they can be received in in order. And
these are going to be based on, you
know, per say have a 200 Gbit per second
link. um that link times the uh the um
end to end round trip time the I call er
here for end to end round trip which
which we'd like to bound it like four
microsconds that that's not much that's
100 kilobytes um for for each links
that's a that's an amount of SRAMM much
smaller than the fi um for the links
that's a reasonable number um the u the
other buffer that I have is link buffer
and the link buffer should not be big um
all that does is delay your discovering
that you have a
because you know people are accepting
your packets and just stuffing them in
the buffer. Um if the if the packet
blocks you know there's no point in in
delaying your knowledge that you you've
run into some problem. Um so here it's
going to be the round trip time over a
fiber which is about you know 100
nconds. So these are even smaller
buffers and so even if you have a switch
with a lot of bandwidth you know say you
know 128 of these 200 gigabit per second
links it's still you know um a tiny
amount of SRAMM that that you really
need um in buffering in the switch. Um
so so the key here is um is basically to
make the expected case that you fly over
um links and through switches and never
wait for anything. And you do this by
dropping packets as soon as they start
waiting very long. Now, how do you get
around um not dropping all your packets?
And that really falls under the heading
of of congestion avoidance. Um and this
is remember the case when all these guys
want to send here, right? If you just
let them do that, they wind up
saturating even if they globally
adaptively route saying, "Oh, we're all
going to the same group. Let's go
different ways." They're going to get
here and it's going to back up and um
because you can't deliver that um it's
then going to back up, you know, to the
network phenomenal called tree
saturation and slow down everything in
the network. Now the traditional way
people have dealt with congestion
avoidance um has has been um things
things like ECN where you wait till it
happens and then you tell people oh
you've congested the network please slow
down and and you have some kind of you
know you ramp down on on the uh source
nodes that are causing the congestion
and then when the congestion is relieved
they have some way of ramping ramping
their um production back up. The problem
with that is you've already done a lot
of damage. Um so what's much better is
to use a reservation protocol right
where to avoid congesting the network
you you would basically say okay um I
want to use this endpoint I'm going to
send that endpoint a request for
reservation and then he'll say okay
here's your reservation just like you
going to a restaurant right if you make
a reservation for table you get there
you sit down you don't wait you know if
they can't take you at 600 they'll say I
can take you at 7:00 you don't bother
leaving your house until you know 5
minutes before 7:00 so you get there at
at 7:00
um your packet should be doing the same
thing. Now the the thing you might point
out is that gee I have to send a message
to the destination. Um you know you
might as well piggyback some data on
that. And so um a thing that um Ted Jane
came up with in his PhD thesis was um
the speculative reservation protocol
where you basically send the packet um
speculatively. And so the dotted line
here is a speculative packet. Um and if
he gets to the destination um he
basically, you know, you're done. But if
he um gets close there and his time to
wait runs out and you drop the packet, a
knack gets sent back, but a forward
header continues on um and actually
sends um you know, sends the reservation
back um to to the source. And and this
way, you know, if if if you get lucky
and you get through the network, um you
actually have piggybacked some data on
the request for reservation, if you um
get get blocked in the middle, um that
knack comes back and then you wait until
your reserve time and then you're
guaranteed to be received because you
haven't oversubscribed
um the output bandwidth at that point in
time. Um you know, since you know, this
paper was published in in 2012, there
have been a couple other similar u
protocols published that are essentially
based on on reservations. And if you use
a reservation protocol whether it's
speculative or not I mean what the
speculative protocol buys you is in the
case when the network is not loaded you
don't have to pay two round trips to get
there you know one round trip to uh you
know get the reservation and another one
way to deliver the packet and the
remainder of that round trip for the
act. Um now um the the real problem is
um remember is usually an oversaturated
link into a destination. And if you have
networks with with very fine grain
transactions, what you really want to do
is um um take advantage of the fact that
the last hop before that um link is
where you want to make the reservation.
So if you're going to be um you know you
dropping that speculative packet, you're
going to be doing it at the last um hop
switched and and if you u um do so then
that um last hop switch makes the
reservation for you um rather than
having to go all the way to the
destination um to get that reservation.
So here here's some experimental data on
how this works in practice. Um this is
an an experiment where um we're running
uniform traffic and then we're running a
hot spot of um um aggressors overloading
some hotspots by 5x starting at um 2 *
10 4 cycles into the simulation. Um and
what you see is um with no congestion
control um your throughput drops to an
abysmal amount of this is sort of the
innocent bystander traffic um you know
getting cut to a quarter of its original
bandwidth. Your latency goes offcale. So
everybody who's suffering here is
sitting in a very long line um and and
things are very bad um and it doesn't
recover. Um, with ECN, you one of these
things where you wait for the congestion
to happen and then you recover. Um, you
see that your latencies are going up way
up and you say, "Ah, I have congestion.
Let's tell these misbehaving guys who
are overcribing the hotspots to slow
down." And you do and they recover. But
it takes, you know, um, you know, you
know, you know, 40,000 you 50,000 cycles
for that recovery to happen. Actually,
it cut it's it's more like 100,000
cycles before the latency um, comes back
down. And um you know that that's not
really tolerable you know in these deep
learning environments where it's the
tail latency you worry about. You you
you worry about the the the latency of
that last packet because there's
synchronization going on here. Everybody
else is is waiting for him to show up.
If you're using the last hop reservation
protocol um it it's not entirely a
non-event. I mean your average latency
goes up and that's because some people
are waiting to deliver to those five um
those 5x over subscribed nodes. But your
overall throughput doesn't budge. you're
not hurting the innocent bystander
traffic at all. It's just getting
through as if there was no other thing
because you're not tree saturating the
network, discovering that and then
throttling back. Um, so I'm just about
out of time, so let me wrap up here. Um,
AI is everywhere. It's transforming all
aspects of our life. It's very exciting
time technically to see all the, you
know, great applications. Um, this has
been enabled by GPUs and it's currently
being gated by GPUs. There's huge
demand. People would like to build even
larger language models. They'd like to
do all sorts of exciting stuff with AI.
Um, but they're limited. They they the
GPUs are only getting so much faster
each generation. And very little of
that, by the way, is coming from process
technology. Um, and um, you know, we're
having to build larger and larger
clusters to meet that growth in demand.
Um, and you know, one way we can get a
little bit more performance, although
it's a one-time card, is to try to
specialize our networks um, for AI. And
to do this we need to look at how these
um AI models are parallelized and
specialized for that. So um you know for
model parallelism it's all reduced right
you you basically compute some parameter
gradients and you want to share those
parameter gradients with every other
copy of the model. Um that that's an
easy one. Um for tensor parallelism it's
broadcast and all reduce. And then
there's something called sequence
parallelism that I really didn't have
time to go into, but it really demands a
gather scatter um which which puts real
demands on the network if you're um not
doing it right. Today we have a pretty
effective solution um for AI and in fact
we have a lot of customers who um can
basically just order a DGX superpod off
the shelf um and build machines up to
tens of thousands using Infiniban where
the first sort of 256
are are clusters of NVLink networks and
it's particularly effective to um do
sort of model parallelism within those
NVLink networks where the communication
demands are higher um and then do some
reduction of bandwidth request uh before
you go out over the Infiniban where the
where where the bandwidth tends to be
more um more expensive and on both the
Envy link and the Infiniban networks we
have sharp collectives and so things
like all reduce essentially double our
effective bandwidth
so I I gave you some of glimpse of the
future and these are a bunch of things
we're working on um in Nvidia research I
should add this disclaimer this talk is
my own opinion and does not reflect the
opinion of Nvidia or any future Nvidia
product plans um I'm personally very
excited about co-acked optics we have a
project in MB research to do DWDM
co-ackaged optics and can really um
change that equation you know that that
bad report card we have for optics right
now which is mostly bad in the cost
column um we're hoping we can really you
know you know drive both the cost and
the power um of optical interconnect
quite a bit down u by integrating it um
if you do global adaptive routing you
wind up enabling topologies like the
dragonfly um that essentially give you
2x cost performance um twice the
performance for the same cost. Um, but
doing so requires, you know, kind of
being clever and making a decision for
each packet about when it's time to to
route globally, basically to miss to to
route non-minentally to to a far group
rather than going to the group that's
your actual destination. Um, I see a lot
of networks today where the worst case
packet delay is quite large. I mean, for
Ethernet networks, it can get into the
milliseconds. And, um, to me, this is a
failure of flow control. And you really
don't want that. You want, you know, an
agile network where a packet either gets
to a destination or you drop it and ask
for a reservation. Um, you should never
have a packet waiting even, you know, a
microcond in the network. There's
there's no point to that. You should
either have the the bandwidth reserved
or don't send um don't send the packet.
You know, sending lots of packets and
having them pile up in the middle of the
network is a really clumsy way of
allocating bandwidth. And and the way to
do that um is to to do a reservation
based congestion avoidance where if your
packet can't get through the almost
immediately um you send it to that last
hop and that last hop sends you your you
know it makes a reservation for you
reserves a certain amount of bandwidth
for you at a point in time and tells you
when to try again and then when you try
again you will get right through. So I
think the the future for networking is
really bright. AI has created a huge
demand and a huge demand with very well
understood patterns that can benefit
from a certain amount of network
specialization. Um so that I guess I
have five minutes left if people have
any uh any questions. Let me stop sharing.
sharing.
Bill, there are some questions in the in
the uh Slack channel. I don't know
whether you do have access. I can
Oh, I'm not on the Slack channel. I was
hoping they'd put them in the Slack. Oh,
it says hot eye Slack link. If I click
on this resources thing, do I get that?
Yeah. Or I can read them for you. I
mean, Why don't why don't why don't you
read them? Um
okay. First of all, we have a qu a
question for CC ICC1.
Um CeCe is asking about the similarity
between AIM ML training inference and
blockchain mining uh and 5G uh wireless
that they all rely heavily on gems. Okay.
Okay.
That's an interesting question. Um I
don't I didn't think blockchain mining
require relied on gems. there's a
particular cryptographic calculation
that that it's doing. But from the
networking perspective, I think you know
the AI training is the real um um heavy
network user. The inference tends
um tends to run mostly on one GPU
although the big models now like you
know single copy of of uh of GPT4
doesn't fit on one GPU but even then the
communication demands are a lot lower
than than for training.
Okay. Then there is another question
from Harsha Cheni and he's asking you
whether you have considered hollow core
optic fibers for higher speed of light
in vacuum does lower propagation delay.
Oh that's an interesting question.
Right. So instead of getting an index of
you know four or whatever and running at
you know root two speed of light I can
or half the speed of light I could uh I
could run at the speed of light. We
haven't looked at it. It's an
interesting idea. I'll have to talk to
my optical people about that. Yeah, a
question from Sayan which by the way I
can only subscribe is asking about you
call this graph based topologies. What
is your opinion of the practicality of
graph based topologies like jellyfish
uh uh or that it can offer higher
diversity than dragonfly? I mean the way
I would recast this question is that
your interest in low ultra low diameter
networks, right? So they
Yeah. So I I have to confess I I
probably need to keep up with the
literature a little bit more because I
have not heard of what a jellyfish is
other than something in the ocean that
stings you. Um but um I'm always looking for
for
it is the same for networking.
It's a it's a it's a bunch of connection
randomly between nodes.
Okay. I I'll have to look that one up. I
uh I'm always looking for better um
better ideas and and trying to feed them
into our product people. So I will try
to find that reference and uh and
understand it but I I don't know what it
is right now.
Yeah. Taylor has another question uh for
CPU compact optic channel bandwidth. Do
you see go beyond 50 Gbit per second NRZ
to pump 4 or will it stay NRZ?
Oh, that's a really good question. So
it's an optimization and and we've been
trying to to look at the trade space
there um about um whether to push
um and actually it's a sort of a
threedimensional trade space because
it's you know what is your baud rate um
how many levels do you have um per color
and how many colors do you have and
there's a sweet spot on the electronics
keeping the baud rate under 50 50 or
under um because 50 or under, you can
use some very simple serializers and
deserializers that give you a very low
energy per bit in the op in the
electronics. Um, and um, and that's true
of whether you're driving NRZ or PAM 4.
And so so you you like to keep
relatively slow to keep the the clocking
logic in in the electronics slow power.
And then the question is do you drive
more levels which has some noise and
link margin implications um or do you
stick NRZ and run more colors and and
you know I think the answer there's not
obvious but I think you do want to stay
relatively low bandwidth to keep the
electronics low power.
There is a question from Nav Nit Raal uh
and it relates to your dragonfly OCS
design that you presented before I
believe and he's qu is asking whether
you are able to characterize uh flows
like elephant versus mouse flows on OCS.
What is your point of view on that?
Yeah, we haven't looked at at that very
much. Um we we've looked at using the
OCS mostly for incremental buildouts.
you don't have to pull cables when you
add another group. And for field
replacement of of a of a group that has
a bad GPU, so you can take that group
down, replace the GPU, and then and then
bring it back up. Um, you know, in a lot
of these applications, um, it isn't
really flows the way traditional
networking is. It's you're exchanging
parameters and it's, you know, one, you
know, um, collective operation and all
reduce or something like that. And so in
in this in the sense of of each phase of
the deep learning algorithm is something
like you're doing an all reduced you're
doing a broadcast. It's not a a flow
from a single you know port to another
port somewhere. Yep. And I think there
are other questions but we are at the
end of the hour and uh obviously this is
a a talk that gives goosebumps to a
network designer in particular the
second half of the talk. So you you
relate to all these problematic and
these discussions and this uh Okay. So
well let's let's give Bill a big round
of virtual applause. So it's a pleasure
to have you. Uh I thought interconnects
once again.
Okay. Well, thank you very much. I
actually have to run to the airport and
uh I hope I run into you guys in person
one of these days. Bye. Absolutely.
Absolutely.
Videodaki o ana atlamak için herhangi bir metin veya zaman damgasına tıkla
Paylaş:
Transkriptlerin büyük çoğunluğu 5 saniyeden kısa sürede hazır
Tek Tıkla Kopyala125+ Dilİçerikte AraZaman Damgasına Atla
YouTube URL'sini Yapıştır
Tam transkripti almak için herhangi bir YouTube video bağlantısı gir
Transkript Çıkarma Formu
Transkriptlerin büyük çoğunluğu 5 saniyeden kısa sürede hazır
Chrome Uzantımızı Yükle
YouTube'dan ayrılmadan transkriptlere anında eriş. Chrome uzantımızı yükle ve izleme sayfasında tek tıkla herhangi bir videonun transkriptine ulaş.