The keynote by Bill Dally emphasizes the critical role of high-performance networking in advancing Artificial Intelligence (AI), particularly in large-scale model training, and outlines current and future technological directions to meet these demands.
Mind Map
クリックして展開
クリックしてインタラクティブなマインドマップを確認
Okay,
today we are honored to have Bill Build Deli
Deli
So I would define uh uh build daily
keynotes as the yard stick by which we
measure advances in high performance
computing and high performance
which means that every few years you go
and listen uh to build a keynote you
know where we are in terms of state of
so this is Bill uh bio taken from the
web page obviously if we have to go
through this there wouldn't be any keynote
keynote
because there are so many achievements
and and it will take probably the full
hour or more. So I I today I'm going to
give a very short but very personal
introduction of Bill Delhi. So
So
this introduction is as follows. Believe
it or not be Bill you changed my life.
So I was a graduate student and I
stumbled in some lecture notes on a on a
summer school for students that you gave
I believe in Canada in B at the end of
the 80s. Okay. So this was early 90s for
me and then I came into your notes and I
had this uh I mean crystal clear
definition of what the network is. The
network is a topology is routing and is
a flow control. And guess what? The most
important thing is flow control is not
exactly the topology and the routing.
Okay, spoiler alert.
And then when I saw this, I said, who is
this William J. Delhi? This guy is
amazing. And then I went into this thing
and I discovered the virtual channels.
Okay. And then I went into some of your
papers. I mean I read this paper I'm not
kidding you maybe 10 or 15 times. Okay.
And uh so the short story is that the
Bill Delhi that we know today is not a
coincidence. Okay. So it's it was
already there. It's just doomed over
there. And without further ado I think can
can
with the key speaker.
Okay. So let me um share my screen.
Thank you for That was definitely a
blast from the past. I don't think I've
looked at my uh PhD thesis in several
decades. Um but uh you know it's it's
interesting you know how many things
stay the same and how many things
change. Um you know I uh wrote my first
paper on interconnection networks in the
1970s and then I worked at Caltech um
you know with with Tre sites on on the
cosmic cube and after we had the cosmic
cube working I was not happy with you
know I was writing programs for I was
not happy with the performance of the
interconnect and so you know spent a lot
of time thinking what an ideal
interconnect would be both in terms of
topology and in terms of flow control
and and that's where the you know
virtual channels and the move to uh you
lowdimensional Taurus topologies um came
about. Um so so the real focus then was
on running scientific applications. We
wanted to run big numerical simulations
of physical processes. Um and many
things are the same today but the
problem is different. What we really
want to do today is is is run AI because
AI is everywhere. We see it um you know
revolutionizing all all sorts of life.
It's uh already taking a big role in
education. I can see very quickly that
we're going to have individual AI tutors
for every student that knows how to
motivate them, understands how they
learn and can tailor the delivery of
material to, you know, basically both
motivate them into how they can best
absorb it. It's already making a huge
impact in medicine both in analyzing
images of all sorts as well as in, you
know, mining lots of data to do
diagnosis. Um, all forms of
entertainment are are using AI in one
form or another. um things like AI
copilot is already giving a 1.5x
improvement in productivity for
programmers and I expect that that will
um you know go get much larger and in
chip design we are already um you know
applying AI the little cartoon I have
here is for a program we have called
prefix RL that designs optimum carry
chains by viewing it almost as a video
game where you stick in the look ahead
element the little green dot on the
bottom on the bottom right and it is
done carry chains that beat any human
design and they're bizarre designs.
Humans would not come up with them. Um,
and so AI is everywhere and um, it's
really kind of fun being a hardware
designer because AI has been enabled by
by hardware and in particular by GPUs.
There are three ingredients that make AI
work. Um, they're algorithms that I sort
of illustrated by the I don't even see
the mouse. Um, illustrated by the Alexet
graph on the right here. Um, these
algorithms for the most part have been
around since the 1980s. um deep neural
networks, convolutional neural networks,
and training them with um back
propagation and stoastic gradient
descent have all been around since the
1980s. Um so it takes algorithms, it
takes data, large amounts of data
labeled or unlabeled um and then the
third ingredient is compute. And it
wasn't until we had enough compute that
we could train a large enough model and
a large enough data set um in a
reasonable amount of time that the AI
revolution really took off you know on
the order of of 10 years ago. Um now
that we have that compute um you can say
that was back in 2012 with AlexNet um
the growth has been phenomenal. um you
know during the time that people were
doing um largely confinets for images
those grew by about two orders of
magnitude in the demand the pedlop days
of training um over about a three-year
period now that we're in the large
language model regime um we see you know
three orders of magnitude over over a
three-year period and the little dot in
the upper right here is where I estimate
GPT4 is you know 10 to the six pedlop
days you can think of that as as a
thousand exop flop days, a thousand days
in a next flop machine to train GPT4 in
in 2023.
Um so um how how does the AI world
differ from the HPC world? Well, people
care about it. if if care is defined by
money. Um the the market for AI training
and inference is expected to be $300
billion in 2026
whereas the you know supercomputer
segment of the HPC market might might
hit 10 billion that year if if we're
lucky. Um so it's 30 times larger. Um
it's dominated by low precision. Um you
know you know FP16 and lower. Um we're
really using mostly FP8 um now. and and
I'm hoping that if we're very clever we
can even drive it down below that um as
opposed to you know typically you know FP64
FP64
the the dominant operation is a matrix
multiply and um when you look at how
it's used in these applications it's
limited by both commute compute and communication
communication
um you know it's interesting that there
there is an export restriction on our um
A100 and H100 parts now we cannot ship
them um to China. Um but we can ship the
A800 and A H800 that have one-third of
the communication bandwidth um to stay
under the um the needs of some of these
most demanding AI applications and also
the applications um have very well
understood persist persistent traffic
patterns. Now in contrast most HPC
applications are actually memory
bandwidth limited. Um it's too bad that
everybody runs a high performance LIN
impact benchmark because it's not the
that but the applications that people
really care about you know in
hydrodnamics in radiation transport in
in uh you know you know climate modeling
are all memory bandwidth limited and and
wind up using a tiny fraction of the
compute because they're just saturating
the memory and also because they're
saturating the memory they're typically
not communication limited so they're not
not actually stressing the network that
much which doesn't give us a hard enough
problem as network designers. Um but on
the AI side it is communication limited
and we understand the traffic pattern.
It's a large and growing market. So we
can specialize the network um for the
needs of AI. Um so what what does AI
need from the network and so you have to
look at how we get parallelism out of an
AI application. Um for as long as you
can you want to do what's called data
parallelism because it's very simple.
you simply create two copies of the AI
model you're running and um you
basically take your data set and you run
part of the data set on on one copy part
of the data set on the other copy and
they exchange parameters they exchange
the gradients I'll have a little more
detail on that in another slide um there
sometimes where you can't do that so for
example um just to hold the parameters
of GPT4 takes over 20 GPUs just to fit
in the memory um and so there you have
to break the model up and run part of
the model um on one device and part of
the model on the other device. There
kind of two ways you can slice this. If
you slice the model horizontally um
you're basically taking individual
matrices um and you basically
decomposing those matrices so that part
of the matrix is on one GPU and part of
the matrix is on another GPU. If you
slice this model vertically um you're
basically taking different layers of the
network and putting them on one GPU and
other iss on the other GPU. And that's
called pipeline parallelism versus
tensor parallelism. Um, and so if you
combine all three of these, the pipeline
parallelism and tensor parallelism
together are called model parallelism
because you're paralyzing the model. And
then the data parallelism is you're
parallelizing over the data set that
you're training on. Um, and if you
combine all three of these, you
basically get the most parallelism. And
you need that to train these large
language models. You need to run on
clusters of thousands to tens of
thousands of GPUs. um and it takes you
know 20 GPUs just to hold one copy of
the GPT4 model. You'll then have you
know a hundred of those copies for you
know 2,00 GPU training um regime. So
let's start with with data parallelism.
So each individual um you know you know
GPU and data parallelism runs the whole
model. It takes a batch of training data
whether it's images or whether it's
tokens um if you're doing um large
language models and you do um the you
know forward and backward pass over that
batch and you compute a set of gradients
which are the changes that you're going
to apply to the parameters and then you
want to apply those changes not just to
your parameters but if you've got a
thousand GPUs all you know running a you
know a thousand different you know
batches um you're going to um I should
say subsets of that batch batch. Um, you
want to combine those gradients and
apply them all to the parameters at
once. So, if you had a batch of 256
images and you ran it over 128 GPUs,
you'd be running two images on on each
GPU. We usually run much larger batches
over even larger numbers of of GPUs. But
the the operation here that you want
your network to be really good at is all
reduce, right? Everybody is basically
adding a bunch of numbers, you know,
into this um set of parameters. And so
you want to basically take those do the
all reduce for each parameter and then
you know some then distribute them um to
everybody. Um for model parallelism
there are a couple ways if you're doing
the tensor parallelism there are a
couple ways you can slice um the tensor.
You can split you know x column wise and
a rowwise or you can split oops
jump too fast. You can split a
columnwise and and x row-wise. And you
really want to do the latter because if
you do it um this way um you wind up
having to do an add um you know
synchronize make make sure these are
both done and then do the ad and then do
the jello where if that's a nonlinear
operation. Um if you um do it the other
way it's completely independent. You can
actually do these two jelloss and you
just output the data. There's no
synchronization no global ad that's
required. Um so if you look at what you
do need out of the network here um on
the forward pass you need to take this
input X and basically copy it to both
sides. So this is basically you know a a
broadcast. Um and on the output it's all
reduced. And remember you're typically
not just doing it two ways. You're
typically doing it you know 10 or 100
ways. Um so it's a big broadcast and a
big all reduce. Um and then on the back
propagation it's exactly reversed. You
wind up doing a broadcast with G and an
all reduce with F. Um the other place
you see a lot of communication in neural
networks is in recommener systems and
the communication there is a need to
access these large embedding tables.
It's not unusual to have recommenders
where the aggregate embedding tables are
terabytes. They don't fit on one GPU or
even one CPU node. And so you wind up
having a communication of taking the um
you know the the words you're trying to
look up and running and accessing the
embedding tables for them. there's often
a reduction within the embedding table
and then and then a communication back.
So how do we meet this need today? Um so
today at NVIDIA we offer something
called a DGX super pod. So let me walk
you through a super pod um starting at
the individual component and working
upward. So the individual GPU is a
Hopper H100. Um the GPU chip itself is
this little rectangle in the middle. Um
these uh six darker rectangles around it
are stacks of HPM3 memory. Um there's an
aggregate 94 gigabytes of HVM3 memory um
with 3.4 terabytes per second of
bandwidth into the GPU. It's an enormous
amount of bandwidth. Um everybody always
thinks these things around on the SXM
module are memory chips. These are
inductors. Delivering the power to this
thing. Um and you know 700 watts at
about 7 volts is a kilmp of power and
and doing that efficiently is actually
quite a challenging technical problem.
Um there's a lot of neat features on on
the H100 like the transformer engine to
uh you know um facilitate the use of
reduced precision in running modern
transformer models. We actually have
dynamic programming instructions for
bioinformatics. But for the purpose of
this discussion on networking, those
aren't particularly relevant. Um the way
to think about this is it's a component
of our system that delivers four pedlops
of sparse FPA performance and 900
gigabytes per second of external
bandwidth. That's the bandwidth of the
NVL links coming out of of this card. at
700 watts. Um so the next step up in
building the DJX super pot is we take
eight of these you know stacked here on
the board and we put them in a system
along with four um you know third
generation NV switches and um each of
these uh you know GPUs has you know um
18 NVLink channels coming out of it.
they're spread across, you know, five
and four, five and four across the NV
switches and then those are connected to
the back panel. Um, so they're actually
18 NV links that come out of the back
panel. And and so it's a it's one way to
think about it, it's an 8:1 taper at
this level from the bandwidth you have
within these eight GPUs and the next
level of of your network hierarchy. Um
this is the wiring diagram and and the
main reason I'm showing you this is just
to show that there are two separate
networks here. Um the um H100s are are
the GPUs here and the NV links on the
H100's um connect down to the NV
switches. Um you know you know each of
these is four and five four and five
going across. Um so that you know the
the the the uh each one HH100 can have
full bandwidth all 18 of its links
talking to um any of the other H100s
within its cluster.
And um it if if you excluded everybody
else it could actually get all the
bandwidth out the back panel as well.
But there is an 8 to1 taper at these NB
um NV switches. Um and that goes to the
180FP connections for NV link out the
back panel. Then the PCI connections out
of the GPU um go to connect connectx7
nicks uh via PCIe switch and those then
connect to four OSFP connectors, two
CX7s on each OFP um to build the
Infiniban network. Um so there's two two
separate networks and we tend to think
of this as a the NVLink network as a
scale up network and the Infiniban
network is a scale out network. Um both
of these networks um support sharp
acceleration. the NB link sharp
acceleration um is such that we wind up
um you know basically taking the N reads
that we would have to do with the read
and reduce on um A100 and an H100 with
sharp um it's basically one read we
basically do you know n reads to send
the partials the switch does the sum and
then we read the result that we want um
and the broadcast result works in kind
of a similar way we have to do only one
right and then we get n writes out of
the NB switch since this is
Basically, it's a 2x for for the all
reduces which are a huge part of the
deep learning um workload. This is the
2x reduction in in uh in demand or the
way to think about it is 2x increase in
the effective bandwidth we have. Um, so
we take those same NV switches um that
we used in the uh in the super pod and
we put two of them in a pizza box and um
that basically is the next level of
interconnect for the NV link the scale
up network. Um and um it winds up having
128 ports coming out um the front um
with uh those are spread across the 32
OSP cages and um enormous amount of of
you know 6.4 4 terabytes per second of
of bandwidth. Um, and that lets you
build up um and out depending on how you
want to do it. Um, each of the boxes
here is a u is eight GPUs, a DGX box.
You have 40 um per rack. Um, and then
you can either connect those up um with
NV links up to 256 GPUs, 32 of those
boxes, or Infiniban up to tens of
thousands. Um, and the Infiniban
director switches are sort of shown here
in the middle of this particular configuration.
configuration.
Um, and there's some real reasons to to
want to do this on the MVLink side. Um,
from the programming system point of
view, the MVLink network is a load store
network. So I can on a given GPU once
you've set the memory maps up, I can do
a load resertore operation into the
memory of any GPU on the NVLink network.
Um, and that just simplifies programming
um, compared to having to marshall your
data into message buffers and make the
MPI calls and everything else you have
to do over on the on the Infiniband
side. But there's also just a lot more
bandwidth here. Um and you know by
section bandwidth there's nine times as
much bandwidth on on the um you know um
NB link network um and uh 4.5 times as
much reduction bandwidth. Um so you know
whether it's on HPC applications um AI
inference um you know running up to 30x
speed speed ups or AI training up to 10x
speed ups there's big advantages to to
that that communication bandwidth and
bisection bandwidth um turn into um you
know for weak scaling if we increase the
size of the model in these large
language models you know you can start
at you know you know billion parameters
and run up to trillion parameters um you
know across that range we get perfectly
linear speed up as we increase the model
size and increase the number of GPUs.
What's more impressive is we also get
very close to linear strong scaling. If
we hold the model size constant, the GPT375
GPT375
billion parameter model um and we scale
it from 64 GPUs to 2K GPUs, it's nearly
linear speed up. Um and that that is a
testament to the very low overhead of um
of scaling on these. Um so so just to
sum up um you know the the current state
of affairs um with the DGX super pod is
um we have these you know wonderful GPUs
the H100s um each of which has 18 MV
link ports coming out of it. The um the
you know bo the DGX box takes eight of
those and four switches and does an 8
to1 reduction. You can then hook up to
you know 32 of those together to make a
256 GPU. um that looks like a big GPU
and and if you are willing to pay for
the switches you can hold that 8 to1
taper so that the bisection width is you
know 1/8 of the uh aggregate bandwidth
out of out of the GPUs um above that you
you hook it up with um infin switches
and and the infin network and you can
scale up to tens of thousands of GPUs
and and it the demands of the large
language models require that we see many
customers training on tens of thousands
of of of u of GPUs and and you know many
of the problems you have, you know, with
a large scientific computer having to
do, you know, checkpoint restart and the
like happen at that scale. Um, as as
well. In fact, many of these machines
would be number one on the top 500 list
if people would take them away from the
profitm operations that they're doing
long enough to run HPL on them. Um
so um that's where we are today and and
I should say one thing about the DGX
Super Pod is that you know compared to
sort of you know trying to acquire the
components of the system and put it
together and bring it up um which is the
way most large supercomputing um
acquisitions are done. Everything is is
preconfigured for the DGX super pod. So
if you buy DGX super pod you buy the DGX
boxes and the MVL link switches and the
Infiniban switches and you plug it
together it just works. All the software
is already done. It's been debugged and
the bring up is a day and not number of
months. So in terms of looking forward,
let's start with a physical layer. And
I'll start with a cartoon of you know
how one of these systems looks to the
logic designer. So you know on the chip,
we've got a bunch of logic that talks to
other logic over 2 millimeter links.
We've got our you know um links on the
interposer um out to the DRAM that uh
you know talks over very short links as
well. and and that's you know one GPU
and then we may have multiple GPUs like
on in the DGX box that's on a printed
circuit board where the connections are
now 30 to 50 centimeters
um you going tens of terabits per second
um which is a a big reduction in
bandwidth I'll point out compared to
what you had um on the GPU um and the
current electrical interconnect here is
around five pigles per bit um when we
need to go board to board we're now
going one to three meters um we've had
another reduction um in in bandwidth and
we're still electrical at about five
pigles per bit. We tend to go optical um
on these cabinet to cabinet links.
They're 5 to 100 meters which is too far
for an electrical link unless you repeat
it every meter or so. Um and the the
power is up, but more importantly the
cost goes way up here. The the
electrical links cost 10 times as much
per unit bandwidth these days um as the
electrical links. And this is worrisome
to us because the electrical links
aren't getting a whole lot better with
time. Um the chart on the left here
shows um that transistors well you get
more of them as you move from say the
the 16 nanometer node down to the five
nanometer node but they're not getting
faster. I mean the fan out of four
inverter delay has been stuck at about
10 picosconds. Um you know since you
know the this the 16 nanometer um
technology that we used way back on uh I
don't remember which GPU that is Pascal
or something was probably our first 16
nanometer um GPU. Um and then what's
also more worrisome is that as we push
our our electrical links, we now have
electrical links going 200 uh gigabits
per second per pair. Um the reach gets
reduced, right? We could go 2 meters
when it was 100 gig. We can now only go
one meter um at 200 gig. And so it means
that we have to jump over to optical
signaling um earlier. It also puts a big
amount of pressure on us to put a lot of
GPUs close together so we can connect
them electrically. and and some
customers kind of push back on this
because we tell them, you know, we want
to build a 200 kilowatt rack, right? And
that's you look at the big
supercomputers, that's what Frontier is.
And if you build a 200 to 300 kilowatt
rack, you jam a lot of GPUs into a small
space, you can connect them all
electrically. You can get to that 256
node Envy link network electrically, but
the and the person just needs to provide
you with cold water right at that rack
and liquid cool the thing. um
that's pushing back against sort of a
data center culture where you know
people like 30 kilowatt racks and they
think 40 is a stretch. Um actually some
of them like even less than 30. Anyway,
um if you look at at the physical layer
the figures of merit are power, cost,
density and reach. Um and then if you
you know sort of bring all the students
into the classroom, you give them the
test and you get the report card. Um
this is what it looks like. Um and the
real thing to do here is to compare the
electrical cable to the active optical
cable. Um and what you see is that you
know power is about the same and a
little bit more expensive for the AOC
and not not that much. The the big issue
here is the cost. Um the density is
better on the active optical cable and
the reach is way better. But to get that
density and reach you got to pay 10x as
much um you know per per gigabit per
second. Um, and you know, we have to
find some way of dealing with that. And
so, one thing we're very excited about
is co-ackaged optics using dense
wavelength division multipplexing
because our estimate is that if it's
successful, it will be even denser than
the current active optical cables at a
cost, this may be a little bit
optimistic, but within a factor or two
of of electrical signaling um, and a
power substantially less than the
electrical signaling um, and a reach
that's the same as the active optical
cable. Um now I should point out that
when you look at these figures of merit
like like especially the ones of power
and cost you have to consider the whole
link. I very often you know talk to
technology providers and they tell me
about this wonderful technology that
they have. It's almost no peak per bit
but they're only looking at the actual
transmitter and receiver. They're not
counting that an automatic advance.
you're not counting, you know, the
serialization and des serialization, you
know, the clock recovery um and and the
actual supply laser. So, you have to
look at the entire system. Um and one
thing I found with with, you know, some
of our um links is sometimes, you know,
we you wind up with overly enthusiastic
designers who overdes the link layer.
You have a very reliable um you know, um
you know, physical layer out here that
doesn't require enormous amounts of
error correction and and all sorts of
stuff. But, you know, they like to
overbuild things and so they'll put, you
know, you'll have a link that's
consuming, you know, um, you know, a
tenth of a picole per bit and they'll
wind up putting a two pajle per bit link
layer in front of which kind of defeats
the whole thing. Um, so let's talk a
little bit about optical signaling. A
system concept we've been playing with
is to um deliver switch cards that
basically have a um a a GPU switch like
the NBLink switch or one of our
Infiniban switches with co-ackaged
optics that basically come out in
pigtails and you can basically
connectorize them out um to a front
panel um with a bunch of of fiber
connectors um fiber ribbon connectors
and a GPU card with one or maybe a
couple GPUs on it with co-ackaged optics
for the NV links, bringing those out to
um a connectorized panel. You would then
package these up. So you would have a
GPU rack with all these GPU cards and a
switch rack somewhere with all these
switch cards. And because these are um
you know now um optical cables, the
reach is 100 meters, there's not this
huge pressure, you know, to build the
the you know 200 kilowatt rack. Um you
can spread them out a little bit,
although you don't want to do that
gratuitously because latency is is
badness, right? And if you make these
cables long, the latency is going to get
larger. Um what what this looks like at
the next level um is uh of a laser comb
source. We're hoping that quantum dot
sources will eventually become um you
know widely available, but right now
these are usually DFB arrays. Um that
you have a supply fiber that brings the
um um um the the comb into the
transmitter chip. And the transmitter
chip has a bunch of ring modulators that
basically either pass or attenuate each
line of the laser comb. Um, and they do
that at a mo moderate modulation rate.
And we're playing with things um, you
know, we're actually looking across the
design spectrum from having, you know, a
a few lines with very high modulation
rates to lots of lines with lower
modulation rates. So modulation rates
we're looking at range from like 25
gigabits per second up to 200 gigabits
per second. uh whatever trade-off you
make there. Um um what we want to do is
we want to have one extra line compared
to the data and and forward the clock
because then we don't have to recover
the clock on the far side. Um here um we
have a ring ring modulator we use to
pass only a selected line to a receiver
where we have a um a diode and a trans
impedance amplifier that um you
basically converts that signal back to
to electrical.
Um I think I basically said most of what
is on this slide. Um now here's sort of
a cross-section picture of what that
looks like. um you know on the switch
chip it sits on an interposer a um on
that interposer then we have a photonic
integrated circuit that has the optical
components um that that is the uh you
know um the rings and and uh the the
wave guides and the couplers and then we
have an electronic integrated circuit
that sits on top of that that basically
takes the you know very short reach link
from the switch um and drives the
modulators for the rings and and
basically has the trans impedance
amplifiers and and serializers and d
serializers. Um on the receive side um
essentially the same thing happens on
the co-ackage GPU. It'll be the same EIC
and the same pick um but just a single
GPU driving it rather than having you
know a switch which will have way more
bandwidth coming out of it. Um
the power budget we're currently looking
at for for an early prototype is about
3.5 ples per bit. The bulk of which is
the laser. I mean these um and I have to
say this is wall plug power, right? So
if you have a 5% efficient laser um you
have to put a bunch of picoles per bit
in to get not so many um coming out um
then then the uh EIC um that's basically
the serializers d serializers modulators
and the like um takes bulk of the power.
Yeah. And the optical thing is hugely
efficient once you have you know once
you have a a comb to modulate and the
electrical signals to drive those rings
they don't take very much power at all.
Um I think we can actually do better.
Right now we're budgeting um 250 fibles
per bit um for this um electrical link
from the uh host to the uh TX and RX and
I think we're actually going to be much
more likely in the in the 100 FETJs per
bit there. Um now the u the link budget
um we're we're playing with you know
after that you know you know 20 to one
you 5% efficient laser we have about
3dBm coming out of our laser you know we
lose a bunch at each of these couplers
every time we we go in and out of a chip
um so that we're down to you know 2.5
going in here we lose some um in the
transmit chip so we're kind of um coming
out of this trans transmit chip at minus
4 dBm Um and then in, you know, a couple
couple rows later into the receive chip
at at almost - 7. Um after we get across
the uh the rings, we're at minus 9 and
and then at the photo detector input
minus 12, which gives us a 2dB margin,
but we would like a much bigger margin
than this. And and and one reason is
we'd like to be able to use these with
um optical circuit switching where the
insertion loss that optical circuit
switch is bigger than 2dB, right? So we
we need to be able to tolerate more loss
on on the link. So, we're working on
trying to make every part of this a
little bit tighter and get a little bit
more link margin. Um, one of our first
prototypes, we're doing a 400 gigabit
per second per fiber, um, which is, um,
25 gigabits per second per um, color per
polarization, eight channels, um, 100
gigahertz channel spacing. We're looking
to scale this up to 800 and 1.6
terabytes per fiber by doing higher
bandwidth and and more channels. Um we
built a number of test chips over the
past couple years starting with um the
the numbers on these are usually the
year that we did them. RPC19 was done in
2019. You know testing micro rings um
you know doing different couplers doing
different um you know wave guides doing
different receive architectures and
we're very close to um you know putting
out a uh you know a full link that we we
hope we'll have operating in the very
near future. So that's a physical layer.
Um what do you do with that physical
layer? you build a topology and it still
amazes me that most people build folded
CL topologies that are often called fat
trees but Klo um wrote his paper in the
60s anderson wrote his in the 80s so
sort of a 20-year credit to to CL um and
um they they're just inefficient because
they sort of um assume the worst case.
you have to sort of route up and then
route back down instead of going exactly
where you want to go. Where the whole
idea of a dragonfly network is in the
good case you can do one long hop. You
only need to to ever uh you know go
across one expensive optical link
assuming you can wire the groups
electrically. Um and as a result of that
um using the assumptions that were in
the original um you know paper I'm
trying to remember you know when when
this came out of 2008 um you it's you
know $80 per endpoint rather than60
those constant numbers have changed but
the ratio hasn't um the and the reason
you get that is that it's um you know a
much lower diameter um and you have
essentially you know lower cost for the
same bisection bandwidth. The challenge
here is um it's really easy to route,
you know, a a fat tree. You basically
just route obliviously up, you know, you
can even route randomly up until you get
to a point where you can see your
destination and then you route, you
know, directly down to that destination.
Um routing a um dragonfly correctly
requires global adaptive routing. There
are otherwise traffic patterns that will
completely bring it to its knees. Um and
it's also very sensitive to congestion.
Um, so you need to have some sort of
congestion avoidance mechanism if you're
going to use um a dragonfly.
Um, now one of the classic problems with
the dragonfly is that building them out
um would require reabling, right? If you
if you had two groups and you wired them
together and then you added a third
group, now you have to take a third of
the connections from each of these two
groups and rewire it to that group. Um,
this problem goes away with optical
circuit switching. It also solves the
problem of partitioning um the dragonfly
up. Um if you have multiple users on it,
you can now give somebody, you know, if
say have 16 groups, give somebody eight
groups, somebody else four, somebody two
and and so on. And then all of their
connections don't have to go through
somebody else's group. They can all be
directly connected to each other. And
you can solve a reliability problem this
way by having a spare group and swapping
a spare group in making the group the
field replaceable unit. Um so dragon
flies and optical circuit switches are a
really nice um combination. Um so so
much for topology. Now that we have our
dragonfly, how are we going to route on
it? Now um it turns out that you get
congestion in two ways. One is you get
core congestion. Say when every um
endpoint, every GPU in this group wants
to talk to this group, they all try to
route minimally. They all want to use
this link and that will congest this one
global link. Um, you can also get
endpoint congestion. If everybody in
this group wants to talk to this one endpoint,
endpoint,
um, they're going to congest the
endpoint. Um, and these are two very
different problems, right? The the
problem on the left, um, of having, you
know, one group all going to another
group has to be solved by routing. It
requires global adaptive routing to
spread the load over the global links by
routing some of them non-minimally. The
problem on the right of endpoint
congestion requires source throttling.
Right? there's no way um of getting more
over this endpoint than what it
bandwidth is. Right? If you're trying to
send three times that much bandwidth,
it's going to all back up and cause tree saturation.