YouTube Transcript:
Inside AI: - AI21 Labs Jamba
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Available languages:
View:
the Transformer architecture has been at
the center of generative AI for the last
several years for text generation but
researchers of course have always been
looking to see what's going to come next
how can we break through the barriers of
Transformers and get even more
intelligence even more performance at a
cost of compute that's achievable and
some researchers came across or devised
the Mamba architecture now Mamba
architectures were super interesting
they performed pretty well but they
weren't quite there an AI 21 lab saw
this and combined together the Mamba
architecture with Transformers and some
mixture of experts as well and came up
with a model that they called Jamba so I
wanted to find out a lot more about
Jamba and Mamba and why don't we talk
about some mixture of experts as well
and I spoke to Yuval belur from AI 21
Labs here at the AWS generative AI Loft
in San Francisco and I started off by
just asking the question
what is Jamba so Jamba is a novel
architecture that interleaves layers of
Transformer Mamba and mixture of experts
in order to overcome the main problems
of Transformer architecture which is
speed and memory consumption okay I love
this okay so in in the description of
what it is you've basically just done
this whole big list of all these
Technologies some I guess most people
have heard of like Transformer
architectures maybe we can work
backwards what what's wrong with the
Transformer architecture that's what
we've been using for a while a lot of
big models made from that what what do
you see as the challenges there yeah so
Transformers really transformed P not
intended H really like the language
natural language processing industry
because it has such high quality and
really it's from I think 2018 uh that it
really started and really picked it up
and all the community and all the
research really uh all the research uh
Labs really took this architecture and
made like small improvements here small
improvements there and really the
quality is unmatched right like in the
way that it's built that in every layer
right in every Transformer block we have
the tension and block which essentially
has the connections between every token
to every token in the sequence all right
so this is something which is very very
expressive it allows you to get really
high quality outputs but it comes with a
quadratic complexity right you have to
keep this metrix both for memory and for
inerts so you're talking about context
size here so as the context gets bigger
as model gets figger there's a quadratic
growth in the overall size of the model
and then I guess compute cost and
latency and everything else is that what
we're talking about yeah definitely so
in shorter context like everything in
complexity right in shorter inputs
shorter context it doesn't really matter
right it can be whatever you want some
whatever function it what when it's
short context doesn't really matter but
right now we're like right in the
beginning right think about gpt3 right
2K context window yeah like right now
we're having like 1 million context
window 256 context window like the
standard like the basic thing it's 32 or
64 128k of context window so when we're
talking about this type of land then
like it's really meaningful and there
you really see that slow slow
performance or Transformers and really
right so it's like there's like if we're
talking just about the time so the in
the training time it's is something
which is clearly uh quadratic interest
time is also like originally quadratic H
but again when a lot of work has been
done to really improve improve that and
make it linear time it does come with a
cost right so it comes with a cost of
saving KV cash which essentially means
you're paying with memory yeah so again
these are the problems of like time and
memory that really keeps Transformers to
be broadly used in production everywhere
every time you in something which is
fast or with low memory consumption
which essentially translate to money
sure so talk to me about KV cache so
we're talking about in we're not talking
about the cache which is sort of like on
the outside of generation we're talking
about the cach internally within the
structure yeah so the KD cach is is part
of the the tension mechanism right there
the k k is the key this value okay so
the KV cach is just the way to save H
the all the like the the sequence that
you already had so you're saving it in
the cash and now in the next fit forward
maybe I'll even uh go back a bit okay
okay so how does it work I know people
hate hearing about it right it's like
the most basic basic thing to say but
you have a sequence and you have to feed
forward for every token right you're
just feeding it in model every token
until the generation stops okay and
essentially keep it in KV cach meaning
that you're keeping all the activation
sorry not the activation all the KV
layers value yes sorry of the exension
in the cache of all the all the
sequences you already computed so in the
next time that you're doing uh this feed
forward you can you don't have to
calculate it again you can just take it
from the cache this is how you go from a
quadratic to linear in the inference St
yeah so so trying to get more
performance out of the existing
Transformer architecture more speed and
more speed more speed yeah yeah yes
however it does come with the price of
cash which it's not estra right this is
something that if you have 80 gab of
memory it's it it's from this 80 gab and
if you
if you're looking at like models of like
something like mxtr a * 70 a with that
128 or 256 uh K uh context window or
context really this is like 32 40
gigabyte easily I don't remember the
exact numbers but it really this is one
of the things that stops you from being
able to use one GPU to serve this kind
of thing understand understand so so so
how do we what do we do to solve these
problems where does does Mamba come in
next into the conversation are we
talking mixture of experts there's a
there's a few different things that
people have done to try to solve some of
these problems yeah so we can talk about
it in two ways okay I think that I will
start with the easier one with the one
that most people know which is the
mixture of experts which this is
something that comes to solve only the
inference time only the spin
consideration where here you think about
the fact that you have like really
really big model okay you have a lot of
parameters but here every layer it's not
just like the transformer block but you
have uh in every layer something like
usually it's eight eight experts experts
yes yes H which it has like a really
nice intuition of think about the fact
that for every input you have some sort
of a router and then based on the type
of the input you can say well this is
like a medical input so it's go to a
medical expert this is like a finance
input that goes to a finance expert okay
which is nice in theory and there are
like models like this has originated in
something like this but here when we
talk about neural networks okay inside
here so what you have you do have this
type of router but it's token level
right so it's not doing it like you
don't ask it the question then all of a
sudden the finance experts answers it
it's actually like you so it's token by
token yeah it's token level through the
like the feeding forward in the network
and that's essentially what happens that
we have a model that that like the
the network is built on like have to set
the attention layer and then a router
and then like eight experts and it
passes only through two of them it's
true both for training and for the
infite okay and if someone asks me how
does that work my broad answer and I
wonder if you'll agree with this is we
don't really know but no we know it does
but we can see that it pairs down the
amount of compute and the amount of
parameters you have to go through each
time and somehow it works yes so this
the question of how does it work and I
don't really know but it works I think
it kind of describes uh machine learning
deep learning okay classical machine
learning maybe when it's kind of small
you kind actually understand something
but neural networks like the
explainability is it's a big problem
like you can't really understand what's
happening inside you can guess you can
like put trou stuff H but with language
models it's not something that a lot of
work was successful on that H but what
you can see is first of all you can see
that in the results right where what
happens is that you're just using
because in every uh feed forward you're
using two of eight expert these are like
the standard numbers so you're literally
using like some like a quarter of the
amount of parameters in the model which
is essentially translates to active
parameters so you can see that it has
better uh performance speed wise and
like the nice thing here is that you can
get a model which has like 12b uh
parameters which are active parameters
so you get like a fast model but it's
very high quality model because it
actually has 52 uh billion parameters
inside of it so it has the
expressiveness okay it's really it's
really good in the sense that it can get
a lot of information during training but
during the inference time okay this is
something that will only go through
small part of the model so it's very
very fast yeah you do have to store all
the model like all this all the all of
this do have to go to the memory like
you don't solve the memory issue there
you only solve the speed part you can
also see by the way like the when you
are training the model or when you're
doing inference you can see which uh
which of the experts are activated part
of the training process is to really
make sure that they are Balan that
because models will degradation it's
something that happens in a lot of
things for sure right so you really
don't want the model to always use the
same two or three experts because then
then you'd end up with a smaller model
right it would look bigger but it would
actually be a smaller model it's a
smaller model where you have to pay out
for the memory yeah so essentially like
the B like you don't get anything from
there so you can see like you can put
problems and see how many of them are
activated and you really want them to be
bced it's part of the training process
and we do test it right in the entrance
time to see that for different types of
uh inputs you are using all of them some
in some way like in
the so so we're talking still about
Transformer architectures at the moment
and I guess there's been a a number of
things we talked about a couple of them
there to try to um adapt and to improve
the efficiency and the performance and
the cost of running the models um all
with their benefits and drawbacks I I
remember when I was first talking about
Transformer architectures and I put a
course together about it and we talked
about it and I distinctly remember
saying um recurrent neural networks
they're a thing of the past that's the
way we used to do it now we're doing
Transformers but I got a feeling you're
about to tell me Well recurrent neural
networks are back is that right yes so I
I was one of the of the the wave to say
like oh recur neural network it was very
difficult to work with them H right it's
it's not easy to understand what's
happening like it's really not efficient
to train and the explainability there
was even worse than in other models but
it really seems that what happens now
with Mamba is it's actually funny you
can look at it like two uh in two
different ways you can either look at it
like as a evolution of RNN to linear RNN
to mamba or you can look at it as like a
state space uh Evolution from State
space models to selective State space
model which is again Mamba okay and the
point of all of these which just like
same things same principles at least in
different ways is instead of looking at
everything like all the history I would
say or all the sequence all the context
that you have in every uh step what
you're doing is that you're saving it in
all in some sort of State like something
that you can either think about like as
a quantization or representation or like
really how you take everything that you
had so far and keep it in a way which
will be meaningful to uh determine the
next token and in that case every time
now you have the Fe forward and you need
to predict the next token instead of
looking at all the contexts from the
back you looking at what happened like
as some sort of representation of all
the context that happened yeah so it's
either like the previous state or the
history right it's called H when we're
talking about RNN and state when we
talking about ssms and this is like
something that really emerged it really
reminds RNN but they actually H took it
from the SSM that's the state space
model Y and they really improved on like
the work of State space models in order
to build Mama yeah okay so we're not
back to RNN it just looks a bit like RNN
so of borrows from that we got these
states based models coming in again so
so so can you just describe then Mamba
like we we we've reached the point where
we're talking about Mamba so um what's
the performance like of Mamba and what
what are what are the problems of Mamba
yeah so I think before I'm talking about
the problems of Mamba okay let's talk
about the good thing yes about Mamba
which by the way if you'll ask someone
everyone who's in like the business and
I'm sure that you said the same thing
you literally by the way said the same
thing to me right now anybody who's like
10 years or so uh in the machine
learning business the first thing when
they hear about Mamba everybody's like
excited and they're like H just RNN like
it's a fancy RNN right that that's
really all it is and it's really yes
it's the same concept and that's like
the the amazing thing what Mamba
creation creators did H I will not go
into itth too much I will just say that
they took the state space models which
are very very efficient because all the
there's a lot of things there that you
can calculate before so it's kind of
like using uh CNN a commercial neural
network H to calculate those things so
it was very very efficient they
introduced something which is called
selective St face mobile uh where
essentially the representation is not un
it's not equal for every token right so
if you think about like the phrase I
want to eat hamburger right and you want
to predict the next word not all the
words have the same like want it's not
really giving us anything two a those
are like words that are not as
meaningful to uh to store when we are
determining the state so they did The
Selective part which essentially gives
like can think about like giving
different weights for every token sure
and then like through importance and
that's something that really improves H
the performance the problem is that now
those um metrices that they need to
calculate are no longer constant sure
they sound a lot like attention weights
than those things like I I think that a
lot it kind of is like idea in the idea
part it is and and that's like where
everything really connects right all
those like principles of okay we really
need it to be fast H we really want it
to be something we can and do fast
inference and increase performance but
we're still lacking in terms of quality
with Transformers and that's like what
the creators of Mamba had to show right
because that to plot the graphs that
they are faster than Transformers it's
not hard and it's just like it's
something which has right less there's
no KD cache there's no K&D there and the
all the handling of the context is
linear up to cont constant time what
they need to show is that they are
equivalent in quality yeah so the
selective part really help them to
improve the quality and they did had to
like do a lot of optimization like
Hardware optimization something like
deep in the core sure and they did like
another algorithm to Cate all those
things and but that was like the the
main premise of look we really managed
to improve quality it's literally like
if if you will read their papers you
will show you will see that they're
showing experiment where they are as
good as Transformers on several tasks
and and and much faster right and that's
really the premise right you either have
to go to like I'm improving quality or I'm
I'm
improving performance right and this is
something that like you cannot usually
do at all so they really showed that
they like elevated the state space model
essentially elevated that ends right
it's really kind of the same thing in
and in the end it really elevated it
made it better and really something that
compete with Transformers I will say
like in their work they got up to
like like few million few billions
parameters like until I think 7 billion
is like that's where they took it so
it's nice in theory uh but still needed
some more to show that you can scale
production okay so where do you go from
that like how do you build on top of
Mamba because I guess that's what Jambo
is right yeah so really when we wanted
to release our new line of models we
thought about how to make it best for
production best for developers how can
we take a model which is very very
expressive that is very high quality but
you can also fit it into a single a100
GQ and this is one of the things that
were in our requirements the beginning
and when we first saw Mamba okay this is
published in December 2023 in so it's
really new yes and we started to
experiment with that a little bit and
there was a lot of talk I remember about
maybe just like scale pure mamb just
just take this architecture make it
bigger and like which is not an easy
thing to do by itself but still just
like do a pure Mamba model and it turns
out that even though like on several
tasks it does work really well or like
comparable with Transformers it is
lacking in a lot of elements and I think
the place where you can see it the most
is like tasks which require looking at
specific tokens okay so there's like a
paper called a repeat after me um
Transformers are better than Mamba in
copying tasks here we actually have to
copy parts from the input or even like
easier to think about is fuse shock yeah
okay and there's like a very basic and
known data set IMDb reviews right of
sentiment analysis where if you will run
it and you by sentiment analysis you
kind of want like a binary input like
it's a classification task positive or
negative like these are the actual
labels if you'll give it a Transformer
it will just it it will do right but if
you'll give it to Mamba and this is like
one of the experiments that really
alerted us to this fact it will say
something like can bad all right so
positive or negative it says bad yeah so
it sort of gets the idea of what you're
trying to do but it gets the wrong
actual output which which obviously
could be so that that's significant I
guess because a lot of us are very used
to sort of what in in context learning
and then all the things that come from
that so Rag and everything else comes to
play um and sometimes
we want the model to to be specific
about the actual information we've just
given it that's really important to us
so I guess that's a problem yeah and and
really if you are if you want something
the developers will actually use right
exactly like you said output stability
is important right postprocessing is
something which is important and
something which is like okay
semantically has the same meaning that's
nice but it's not something you can
actually build with and that was the
time where we started to really
play with the idea of combining those
things and really uh one of the nice
things about Mamba is because of the way
because of this architecture it's much
efficient much more efficient to train
it than pure Transformers so what our
team did and they started to play around
with like inter living different types
of layers okay and essentially they
created what we call now Jambo blocks
which is interleaving layers of Mamba and
and
Transformers and of course they added
the mixture of experts as well over
there but that's like less interesting
right now okay so really they
combination of Mamba and Transformer
layers and really see on one hand you
really want as much as Mamba layers
versus Transformer layers right because
you want it to be fast but he do need
some Transformer layers in order to get
the same quality or to to take the Mamba
and elevate it to the places that just
cannot reach by itself so we did like a
lot of experiments on small
scale and it's really nice if you want
by the way we there's a lot of things in
the white paper I Rel like describing
all of them and in the end we came up
with like two different types of jumbo
blocks one of them had one Transformer
layer and three uh Mamba layers and one
was one personal layer and seven okay so
one to three and one to seven and really
all these numbers came in the fact that
well what we want to do is to be able to
take a model which is the model that we
ended up with had 52 billion total
parameters okay I wanted to be able to
serve this type of model on 1 a 100 GPU
with as much as context as possible so
in the choice between 1 to three and 1
to 7 right clearly 1 to 7 is much more
efficient in terms of late lency and
memory right right but it has the same
performance so we opted out to go for
that and this is how our jumbo block
looks like there's one Transformer layer
seven mble layers which four of them has
Mi of X-rays I don't that right so so so
we're mixing The Best of Both Worlds but
I mean through all of that research you
kind of figured out where hopefully
where The Sweet Spot is um there's
always more research to be done right
but but you found a place where there's
a good balance and you trained a large model
model
yeah so there were a lot of like
experiences exactly like you said
experiments sorry like said to find this
sweet spot which shows the 1 to7 and
then we H just try and really when you
want to scale something like that it's
not just is like okay let's connect a
lot of them like let's concatenate all
those layers together one at a time and
put it to training okay something that
does have uh some extra work it has to
be done just to make it to scale okay
okay and that's what with we like did it
in two phases okay so the first Jamba
was released on March this model has 52
billion parameters which is like now we
call it Jamba mini okay okay at 52 total
parameters with something like 12
billion active parameters and this was
like the like the first let's take it
from the few billions to something which
is production grade something that we
can actually like like like we see like
this is what we now call a small model
right it's kind of think about it that's
like a few years ago 7 billion would be
like a huge model and now like 52
billion with mixture of experts that has
like 12 billion active parameters sounds
like very very s yeah I think one thing
that generative I has done is redefine
small medium and larg as terms and what
they actually mean you you've talked
about the experimentation you've done
there and you've got the different size
of models um how and Earth how do you
how have you benchmarked it how do you
know I mean it's more than just a Vibe
check and you're just prompting it and
going yeah that looks good presumably
how do you quantify that it's it's
performance so we chose several academic
benchmarks where we wanted to make sure
that we have like benchmarks that are
tasks that different tasks and also
things that are extractive and
abstractive and because Mamba really
exceled by itself in obstructive tasks
about extractive where you actually need
to copy things from the input not so
much so we had like a lot of um like
combination of several of these days
benchmarks okay you can also look at the
training laws to see you can actually
actually the M converges and R like once
we got to like our final candidate we
did that we have a human evaluation team
in housee and so we used them to like
really determine and see that we are
going in the right direction and that
was like the first experimental
experiment then training the 52 billion
H model parameter that was released on
March and then like that was like the
big release like the announcement of
this architecture and then we took it a
anage to a model which is almost 400
billion total
parameters which is a Jamba 1.5 large
okay so that's what we released like I
think one month two months ago depending
on when
this okay so we like the jumbo 1.5 mini
is a like a fine tun version of the one
we released on March and but it's the
same type of architecture just as large
has a lots lots more of this Jambo block
sure inside of it and so um something
which is a bit new um I understand for
a21 so the the weights are publicly
available yes so one of the like key
things and we really we released the
base model for the Jambo mini on March
to really see how the community uh will
react to it and the responses were
amazing because I think that people
understand and that like
Transformers yes like everybody's
focused on Transformers and there's like
a lot of improvements a lot of like
tricks a lot of people you can ask for
well they're essentially a big Community
around Transformers but at some place it
becomes saturated and there's there's
only there's like a limit of the amount
of tricks you can do and in some case
somewhere someone has to say well we
maybe need new architecture for
different types of tasks or different
type of use cases or when we really need
long context and something that
Transformers really like take them too
much time and so we releas it and open
weits on March to really see what people
will do and you saw that there were like
a lot of downloads a lot of talk around
it like people were excited about that
and that's why when we like we launched
the new jumbo 1.5 series said well we
really want the developers will continue
to engage with that we really want to
create a community here because it's
like it's not something that is just
ours right it's not like we really want
people to adopt that we want people to
take this to the like group right there
we want people to really take it and to
the next level to build something around
it to really like take the research and
push it Forward because we do believe
that there's a lot of places where this
architecture can be improved y sure and
so if people want to get their hands on
it um then I guess the one of the
easiest ways to do that is Amazon
Bedrock right so yeah totally like if
you if you want to like get your hands
dirty and try to like fine tune and like
download models go to haen face for if
you like if you really want to like well
I don't care I just want to use a model
yeah right so Amazon Bedrock is totally
the way to go yeah right you can just go
there like you have large you have mini
H whatever you prefer like whatever is
for your use yeah whatever is for no and
I think that's I think that's really I I
think that's proven to be quite a
successful model I think developers
really Chim with that yeah the idea that
you can actually get your hands on it
you can rip it apart you can go and put
it wherever you want um so I'm assuming
that um so things like olama where we've
got these quantized small models I'm
assuming we're not going to see it there
anytime soon would that be right because
the architecture is quite different
right so you can actually like you can
quantize that so we create a new
quantization technique which is publicly
available in huging face which
essentially takes like the our model
from 16 bit to 8 Bits and then back with
no really information loss oh wow yeah
it's really it kind of like depends on
the fact that most like something like
90 or 95% of the weights are actually in
MLP uh layers so we found a way to
really make this uh quantization on the
Fly and it works really well you can
also by the way like in hugging face you
can change it like to quantize it in a
4bit so you can do that yeah I'm not
quite sure by the way on the other
platforms I'm like I think that we are
like you contact with them but like they
can do it themselves like it's not like
you already have like the Forbe and I'm
I'm it can be squeezed can be squeezed
totally I I'm not sure how it performs I
must say I didn't see it for a bit but
I'm excited to see it well look and I
think this is really exciting like I
mean AI 21 Labs is a small focused team
I think it's probably fair to describe
it like that and so um you you've got
the the they are publicly available
people can go and hack on it and um and
surprise you I guess and show you what
they've done with it as well yeah and
Matt I'm for one really excited to see
whatever the community will do whatever
anybody is doing I'm like Yay yes
absolutely well look thank you so much
for spending time with me and just going
through all of this there's a lot to
take in here and I think that um it's
really exciting to see things that are
being done which are um which are
looking elsewhere other than
Transformers and trying to find the next
PATH forward so thank you so much for
spending time with us thank you so much
for having me a huge thanks to U and
everybody from AI 21 labs for helping to
make this video please give this video a
thumbs up and subscribe to the AWS
developers Channel as well maybe click
on one of these videos around us and
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc