YouTube Transcript:
Inside AI: - AI21 Labs Jamba

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

the Transformer architecture has been at

the center of generative AI for the last

several years for text generation but

researchers of course have always been

looking to see what's going to come next

how can we break through the barriers of

Transformers and get even more

intelligence even more performance at a

cost of compute that's achievable and

some researchers came across or devised

the Mamba architecture now Mamba

architectures were super interesting

they performed pretty well but they

weren't quite there an AI 21 lab saw

this and combined together the Mamba

architecture with Transformers and some

mixture of experts as well and came up

with a model that they called Jamba so I

wanted to find out a lot more about

Jamba and Mamba and why don't we talk

about some mixture of experts as well

and I spoke to Yuval belur from AI 21

Labs here at the AWS generative AI Loft

in San Francisco and I started off by

just asking the question

what is Jamba so Jamba is a novel

architecture that interleaves layers of

Transformer Mamba and mixture of experts

in order to overcome the main problems

of Transformer architecture which is

speed and memory consumption okay I love

this okay so in in the description of

what it is you've basically just done

this whole big list of all these

Technologies some I guess most people

have heard of like Transformer

architectures maybe we can work

backwards what what's wrong with the

Transformer architecture that's what

we've been using for a while a lot of

big models made from that what what do

you see as the challenges there yeah so

Transformers really transformed P not

intended H really like the language

natural language processing industry

because it has such high quality and

really it's from I think 2018 uh that it

really started and really picked it up

and all the community and all the

research really uh all the research uh

Labs really took this architecture and

made like small improvements here small

improvements there and really the

quality is unmatched right like in the

way that it's built that in every layer

right in every Transformer block we have

the tension and block which essentially

has the connections between every token

to every token in the sequence all right

so this is something which is very very

expressive it allows you to get really

high quality outputs but it comes with a

quadratic complexity right you have to

keep this metrix both for memory and for

inerts so you're talking about context

size here so as the context gets bigger

as model gets figger there's a quadratic

growth in the overall size of the model

and then I guess compute cost and

latency and everything else is that what

we're talking about yeah definitely so

in shorter context like everything in

complexity right in shorter inputs

shorter context it doesn't really matter

right it can be whatever you want some

whatever function it what when it's

short context doesn't really matter but

right now we're like right in the

beginning right think about gpt3 right

2K context window yeah like right now

we're having like 1 million context

window 256 context window like the

standard like the basic thing it's 32 or

64 128k of context window so when we're

talking about this type of land then

like it's really meaningful and there

you really see that slow slow

performance or Transformers and really

right so it's like there's like if we're

talking just about the time so the in

the training time it's is something

which is clearly uh quadratic interest

time is also like originally quadratic H

but again when a lot of work has been

done to really improve improve that and

make it linear time it does come with a

cost right so it comes with a cost of

saving KV cash which essentially means

you're paying with memory yeah so again

these are the problems of like time and

memory that really keeps Transformers to

be broadly used in production everywhere

every time you in something which is

fast or with low memory consumption

which essentially translate to money

sure so talk to me about KV cache so

we're talking about in we're not talking

about the cache which is sort of like on

the outside of generation we're talking

about the cach internally within the

structure yeah so the KD cach is is part

of the the tension mechanism right there

the k k is the key this value okay so

the KV cach is just the way to save H

the all the like the the sequence that

you already had so you're saving it in

the cash and now in the next fit forward

maybe I'll even uh go back a bit okay

okay so how does it work I know people

hate hearing about it right it's like

the most basic basic thing to say but

you have a sequence and you have to feed

forward for every token right you're

just feeding it in model every token

until the generation stops okay and

essentially keep it in KV cach meaning

that you're keeping all the activation

sorry not the activation all the KV

layers value yes sorry of the exension

in the cache of all the all the

sequences you already computed so in the

next time that you're doing uh this feed

forward you can you don't have to

calculate it again you can just take it

from the cache this is how you go from a

quadratic to linear in the inference St

yeah so so trying to get more

performance out of the existing

Transformer architecture more speed and

more speed more speed yeah yeah yes

however it does come with the price of

cash which it's not estra right this is

something that if you have 80 gab of

memory it's it it's from this 80 gab and

if you

if you're looking at like models of like

something like mxtr a * 70 a with that

128 or 256 uh K uh context window or

context really this is like 32 40

gigabyte easily I don't remember the

exact numbers but it really this is one

of the things that stops you from being

able to use one GPU to serve this kind

of thing understand understand so so so

how do we what do we do to solve these

problems where does does Mamba come in

next into the conversation are we

talking mixture of experts there's a

there's a few different things that

people have done to try to solve some of

these problems yeah so we can talk about

it in two ways okay I think that I will

start with the easier one with the one

that most people know which is the

mixture of experts which this is

something that comes to solve only the

inference time only the spin

consideration where here you think about

the fact that you have like really

really big model okay you have a lot of

parameters but here every layer it's not

just like the transformer block but you

have uh in every layer something like

usually it's eight eight experts experts

yes yes H which it has like a really

nice intuition of think about the fact

that for every input you have some sort

of a router and then based on the type

of the input you can say well this is

like a medical input so it's go to a

medical expert this is like a finance

input that goes to a finance expert okay

which is nice in theory and there are

like models like this has originated in

something like this but here when we

talk about neural networks okay inside

here so what you have you do have this

type of router but it's token level

right so it's not doing it like you

don't ask it the question then all of a

sudden the finance experts answers it

it's actually like you so it's token by

token yeah it's token level through the

like the feeding forward in the network

and that's essentially what happens that

we have a model that that like the

the network is built on like have to set

the attention layer and then a router

and then like eight experts and it

passes only through two of them it's

true both for training and for the

infite okay and if someone asks me how

does that work my broad answer and I

wonder if you'll agree with this is we

don't really know but no we know it does

but we can see that it pairs down the

amount of compute and the amount of

parameters you have to go through each

time and somehow it works yes so this

the question of how does it work and I

don't really know but it works I think

it kind of describes uh machine learning

deep learning okay classical machine

learning maybe when it's kind of small

you kind actually understand something

but neural networks like the

explainability is it's a big problem

like you can't really understand what's

happening inside you can guess you can

like put trou stuff H but with language

models it's not something that a lot of

work was successful on that H but what

you can see is first of all you can see

that in the results right where what

happens is that you're just using

because in every uh feed forward you're

using two of eight expert these are like

the standard numbers so you're literally

using like some like a quarter of the

amount of parameters in the model which

is essentially translates to active

parameters so you can see that it has

better uh performance speed wise and

like the nice thing here is that you can

get a model which has like 12b uh

parameters which are active parameters

so you get like a fast model but it's

very high quality model because it

actually has 52 uh billion parameters

inside of it so it has the

expressiveness okay it's really it's

really good in the sense that it can get

a lot of information during training but

during the inference time okay this is

something that will only go through

small part of the model so it's very

very fast yeah you do have to store all

the model like all this all the all of

this do have to go to the memory like

you don't solve the memory issue there

you only solve the speed part you can

also see by the way like the when you

are training the model or when you're

doing inference you can see which uh

which of the experts are activated part

of the training process is to really

make sure that they are Balan that

because models will degradation it's

something that happens in a lot of

things for sure right so you really

don't want the model to always use the

same two or three experts because then

then you'd end up with a smaller model

right it would look bigger but it would

actually be a smaller model it's a

smaller model where you have to pay out

for the memory yeah so essentially like

the B like you don't get anything from

there so you can see like you can put

problems and see how many of them are

activated and you really want them to be

bced it's part of the training process

and we do test it right in the entrance

time to see that for different types of

uh inputs you are using all of them some

in some way like in

the so so we're talking still about

Transformer architectures at the moment

and I guess there's been a a number of

things we talked about a couple of them

there to try to um adapt and to improve

the efficiency and the performance and

the cost of running the models um all

with their benefits and drawbacks I I

remember when I was first talking about

Transformer architectures and I put a

course together about it and we talked

about it and I distinctly remember

saying um recurrent neural networks

they're a thing of the past that's the

way we used to do it now we're doing

Transformers but I got a feeling you're

about to tell me Well recurrent neural

networks are back is that right yes so I

I was one of the of the the wave to say

like oh recur neural network it was very

difficult to work with them H right it's

it's not easy to understand what's

happening like it's really not efficient

to train and the explainability there

was even worse than in other models but

it really seems that what happens now

with Mamba is it's actually funny you

can look at it like two uh in two

different ways you can either look at it

like as a evolution of RNN to linear RNN

to mamba or you can look at it as like a

state space uh Evolution from State

space models to selective State space

model which is again Mamba okay and the

point of all of these which just like

same things same principles at least in

different ways is instead of looking at

everything like all the history I would

say or all the sequence all the context

that you have in every uh step what

you're doing is that you're saving it in

all in some sort of State like something

that you can either think about like as

a quantization or representation or like

really how you take everything that you

had so far and keep it in a way which

will be meaningful to uh determine the

next token and in that case every time

now you have the Fe forward and you need

to predict the next token instead of

looking at all the contexts from the

back you looking at what happened like

as some sort of representation of all

the context that happened yeah so it's

either like the previous state or the

history right it's called H when we're

talking about RNN and state when we

talking about ssms and this is like

something that really emerged it really

reminds RNN but they actually H took it

from the SSM that's the state space

model Y and they really improved on like

the work of State space models in order

to build Mama yeah okay so we're not

back to RNN it just looks a bit like RNN

so of borrows from that we got these

states based models coming in again so

so so can you just describe then Mamba

like we we we've reached the point where

we're talking about Mamba so um what's

the performance like of Mamba and what

what are what are the problems of Mamba

yeah so I think before I'm talking about

the problems of Mamba okay let's talk

about the good thing yes about Mamba

which by the way if you'll ask someone

everyone who's in like the business and

I'm sure that you said the same thing

you literally by the way said the same

thing to me right now anybody who's like

10 years or so uh in the machine

learning business the first thing when

they hear about Mamba everybody's like

excited and they're like H just RNN like

it's a fancy RNN right that that's

really all it is and it's really yes

it's the same concept and that's like

the the amazing thing what Mamba

creation creators did H I will not go

into itth too much I will just say that

they took the state space models which

are very very efficient because all the

there's a lot of things there that you

can calculate before so it's kind of

like using uh CNN a commercial neural

network H to calculate those things so

it was very very efficient they

introduced something which is called

selective St face mobile uh where

essentially the representation is not un

it's not equal for every token right so

if you think about like the phrase I

want to eat hamburger right and you want

to predict the next word not all the

words have the same like want it's not

really giving us anything two a those

are like words that are not as

meaningful to uh to store when we are

determining the state so they did The

Selective part which essentially gives

like can think about like giving

different weights for every token sure

and then like through importance and

that's something that really improves H

the performance the problem is that now

those um metrices that they need to

calculate are no longer constant sure

they sound a lot like attention weights

than those things like I I think that a

lot it kind of is like idea in the idea

part it is and and that's like where

everything really connects right all

those like principles of okay we really

need it to be fast H we really want it

to be something we can and do fast

inference and increase performance but

we're still lacking in terms of quality

with Transformers and that's like what

the creators of Mamba had to show right

because that to plot the graphs that

they are faster than Transformers it's

not hard and it's just like it's

something which has right less there's

no KD cache there's no K&D there and the

all the handling of the context is

linear up to cont constant time what

they need to show is that they are

equivalent in quality yeah so the

selective part really help them to

improve the quality and they did had to

like do a lot of optimization like

Hardware optimization something like

deep in the core sure and they did like

another algorithm to Cate all those

things and but that was like the the

main premise of look we really managed

to improve quality it's literally like

if if you will read their papers you

will show you will see that they're

showing experiment where they are as

good as Transformers on several tasks

and and and much faster right and that's

really the premise right you either have

to go to like I'm improving quality or I'm

I'm

improving performance right and this is

something that like you cannot usually

do at all so they really showed that

they like elevated the state space model

essentially elevated that ends right

it's really kind of the same thing in

and in the end it really elevated it

made it better and really something that

compete with Transformers I will say

like in their work they got up to

like like few million few billions

parameters like until I think 7 billion

is like that's where they took it so

it's nice in theory uh but still needed

some more to show that you can scale

production okay so where do you go from

that like how do you build on top of

Mamba because I guess that's what Jambo

is right yeah so really when we wanted

to release our new line of models we

thought about how to make it best for

production best for developers how can

we take a model which is very very

expressive that is very high quality but

you can also fit it into a single a100

GQ and this is one of the things that

were in our requirements the beginning

and when we first saw Mamba okay this is

published in December 2023 in so it's

really new yes and we started to

experiment with that a little bit and

there was a lot of talk I remember about

maybe just like scale pure mamb just

just take this architecture make it

bigger and like which is not an easy

thing to do by itself but still just

like do a pure Mamba model and it turns

out that even though like on several

tasks it does work really well or like

comparable with Transformers it is

lacking in a lot of elements and I think

the place where you can see it the most

is like tasks which require looking at

specific tokens okay so there's like a

paper called a repeat after me um

Transformers are better than Mamba in

copying tasks here we actually have to

copy parts from the input or even like

easier to think about is fuse shock yeah

okay and there's like a very basic and

known data set IMDb reviews right of

sentiment analysis where if you will run

it and you by sentiment analysis you

kind of want like a binary input like

it's a classification task positive or

negative like these are the actual

labels if you'll give it a Transformer

it will just it it will do right but if

you'll give it to Mamba and this is like

one of the experiments that really

alerted us to this fact it will say

something like can bad all right so

positive or negative it says bad yeah so

it sort of gets the idea of what you're

trying to do but it gets the wrong

actual output which which obviously

could be so that that's significant I

guess because a lot of us are very used

to sort of what in in context learning

and then all the things that come from

that so Rag and everything else comes to

play um and sometimes

we want the model to to be specific

about the actual information we've just

given it that's really important to us

so I guess that's a problem yeah and and

really if you are if you want something

the developers will actually use right

exactly like you said output stability

is important right postprocessing is

something which is important and

something which is like okay

semantically has the same meaning that's

nice but it's not something you can

actually build with and that was the

time where we started to really

play with the idea of combining those

things and really uh one of the nice

things about Mamba is because of the way

because of this architecture it's much

efficient much more efficient to train

it than pure Transformers so what our

team did and they started to play around

with like inter living different types

of layers okay and essentially they

created what we call now Jambo blocks

which is interleaving layers of Mamba and

and

Transformers and of course they added

the mixture of experts as well over

there but that's like less interesting

right now okay so really they

combination of Mamba and Transformer

layers and really see on one hand you

really want as much as Mamba layers

versus Transformer layers right because

you want it to be fast but he do need

some Transformer layers in order to get

the same quality or to to take the Mamba

and elevate it to the places that just

cannot reach by itself so we did like a

lot of experiments on small

scale and it's really nice if you want

by the way we there's a lot of things in

the white paper I Rel like describing

all of them and in the end we came up

with like two different types of jumbo

blocks one of them had one Transformer

layer and three uh Mamba layers and one

was one personal layer and seven okay so

one to three and one to seven and really

all these numbers came in the fact that

well what we want to do is to be able to

take a model which is the model that we

ended up with had 52 billion total

parameters okay I wanted to be able to

serve this type of model on 1 a 100 GPU

with as much as context as possible so

in the choice between 1 to three and 1

to 7 right clearly 1 to 7 is much more

efficient in terms of late lency and

memory right right but it has the same

performance so we opted out to go for

that and this is how our jumbo block

looks like there's one Transformer layer

seven mble layers which four of them has

Mi of X-rays I don't that right so so so

we're mixing The Best of Both Worlds but

I mean through all of that research you

kind of figured out where hopefully

where The Sweet Spot is um there's

always more research to be done right

but but you found a place where there's

a good balance and you trained a large model

model

yeah so there were a lot of like

experiences exactly like you said

experiments sorry like said to find this

sweet spot which shows the 1 to7 and

then we H just try and really when you

want to scale something like that it's

not just is like okay let's connect a

lot of them like let's concatenate all

those layers together one at a time and

put it to training okay something that

does have uh some extra work it has to

be done just to make it to scale okay

okay and that's what with we like did it

in two phases okay so the first Jamba

was released on March this model has 52

billion parameters which is like now we

call it Jamba mini okay okay at 52 total

parameters with something like 12

billion active parameters and this was

like the like the first let's take it

from the few billions to something which

is production grade something that we

can actually like like like we see like

this is what we now call a small model

right it's kind of think about it that's

like a few years ago 7 billion would be

like a huge model and now like 52

billion with mixture of experts that has

like 12 billion active parameters sounds

like very very s yeah I think one thing

that generative I has done is redefine

small medium and larg as terms and what

they actually mean you you've talked

about the experimentation you've done

there and you've got the different size

of models um how and Earth how do you

how have you benchmarked it how do you

know I mean it's more than just a Vibe

check and you're just prompting it and

going yeah that looks good presumably

how do you quantify that it's it's

performance so we chose several academic

benchmarks where we wanted to make sure

that we have like benchmarks that are

tasks that different tasks and also

things that are extractive and

abstractive and because Mamba really

exceled by itself in obstructive tasks

about extractive where you actually need

to copy things from the input not so

much so we had like a lot of um like

combination of several of these days

benchmarks okay you can also look at the

training laws to see you can actually

actually the M converges and R like once

we got to like our final candidate we

did that we have a human evaluation team

in housee and so we used them to like

really determine and see that we are

going in the right direction and that

was like the first experimental

experiment then training the 52 billion

H model parameter that was released on

March and then like that was like the

big release like the announcement of

this architecture and then we took it a

anage to a model which is almost 400

billion total

parameters which is a Jamba 1.5 large

okay so that's what we released like I

think one month two months ago depending

on when

this okay so we like the jumbo 1.5 mini

is a like a fine tun version of the one

we released on March and but it's the

same type of architecture just as large

has a lots lots more of this Jambo block

sure inside of it and so um something

which is a bit new um I understand for

a21 so the the weights are publicly

available yes so one of the like key

things and we really we released the

base model for the Jambo mini on March

to really see how the community uh will

react to it and the responses were

amazing because I think that people

understand and that like

Transformers yes like everybody's

focused on Transformers and there's like

a lot of improvements a lot of like

tricks a lot of people you can ask for

well they're essentially a big Community

around Transformers but at some place it

becomes saturated and there's there's

only there's like a limit of the amount

of tricks you can do and in some case

somewhere someone has to say well we

maybe need new architecture for

different types of tasks or different

type of use cases or when we really need

long context and something that

Transformers really like take them too

much time and so we releas it and open

weits on March to really see what people

will do and you saw that there were like

a lot of downloads a lot of talk around

it like people were excited about that

and that's why when we like we launched

the new jumbo 1.5 series said well we

really want the developers will continue

to engage with that we really want to

create a community here because it's

like it's not something that is just

ours right it's not like we really want

people to adopt that we want people to

take this to the like group right there

we want people to really take it and to

the next level to build something around

it to really like take the research and

push it Forward because we do believe

that there's a lot of places where this

architecture can be improved y sure and

so if people want to get their hands on

it um then I guess the one of the

easiest ways to do that is Amazon

Bedrock right so yeah totally like if

you if you want to like get your hands

dirty and try to like fine tune and like

download models go to haen face for if

you like if you really want to like well

I don't care I just want to use a model

yeah right so Amazon Bedrock is totally

the way to go yeah right you can just go

there like you have large you have mini

H whatever you prefer like whatever is

for your use yeah whatever is for no and

I think that's I think that's really I I

think that's proven to be quite a

successful model I think developers

really Chim with that yeah the idea that

you can actually get your hands on it

you can rip it apart you can go and put

it wherever you want um so I'm assuming

that um so things like olama where we've

got these quantized small models I'm

assuming we're not going to see it there

anytime soon would that be right because

the architecture is quite different

right so you can actually like you can

quantize that so we create a new

quantization technique which is publicly

available in huging face which

essentially takes like the our model

from 16 bit to 8 Bits and then back with

no really information loss oh wow yeah

it's really it kind of like depends on

the fact that most like something like

90 or 95% of the weights are actually in

MLP uh layers so we found a way to

really make this uh quantization on the

Fly and it works really well you can

also by the way like in hugging face you

can change it like to quantize it in a

4bit so you can do that yeah I'm not

quite sure by the way on the other

platforms I'm like I think that we are

like you contact with them but like they

can do it themselves like it's not like

you already have like the Forbe and I'm

I'm it can be squeezed can be squeezed

totally I I'm not sure how it performs I

must say I didn't see it for a bit but

I'm excited to see it well look and I

think this is really exciting like I

mean AI 21 Labs is a small focused team

I think it's probably fair to describe

it like that and so um you you've got

the the they are publicly available

people can go and hack on it and um and

surprise you I guess and show you what

they've done with it as well yeah and

Matt I'm for one really excited to see

whatever the community will do whatever

anybody is doing I'm like Yay yes

absolutely well look thank you so much

for spending time with me and just going

through all of this there's a lot to

take in here and I think that um it's

really exciting to see things that are

being done which are um which are

looking elsewhere other than

Transformers and trying to find the next

PATH forward so thank you so much for

spending time with us thank you so much

for having me a huge thanks to U and

everybody from AI 21 labs for helping to

make this video please give this video a

thumbs up and subscribe to the AWS

developers Channel as well maybe click

on one of these videos around us and

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Inside AI: - AI21 Labs Jamba

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Inside AI: - AI21 Labs Jamba