0:02 the Transformer architecture has been at
0:04 the center of generative AI for the last
0:07 several years for text generation but
0:09 researchers of course have always been
0:12 looking to see what's going to come next
0:13 how can we break through the barriers of
0:15 Transformers and get even more
0:18 intelligence even more performance at a
0:21 cost of compute that's achievable and
0:24 some researchers came across or devised
0:26 the Mamba architecture now Mamba
0:28 architectures were super interesting
0:30 they performed pretty well but they
0:33 weren't quite there an AI 21 lab saw
0:36 this and combined together the Mamba
0:38 architecture with Transformers and some
0:41 mixture of experts as well and came up
0:43 with a model that they called Jamba so I
0:45 wanted to find out a lot more about
0:47 Jamba and Mamba and why don't we talk
0:49 about some mixture of experts as well
0:53 and I spoke to Yuval belur from AI 21
0:56 Labs here at the AWS generative AI Loft
0:58 in San Francisco and I started off by
1:00 just asking the question
1:03 what is Jamba so Jamba is a novel
1:06 architecture that interleaves layers of
1:09 Transformer Mamba and mixture of experts
1:11 in order to overcome the main problems
1:13 of Transformer architecture which is
1:16 speed and memory consumption okay I love
1:18 this okay so in in the description of
1:19 what it is you've basically just done
1:21 this whole big list of all these
1:23 Technologies some I guess most people
1:24 have heard of like Transformer
1:26 architectures maybe we can work
1:28 backwards what what's wrong with the
1:29 Transformer architecture that's what
1:31 we've been using for a while a lot of
1:33 big models made from that what what do
1:35 you see as the challenges there yeah so
1:37 Transformers really transformed P not
1:40 intended H really like the language
1:43 natural language processing industry
1:45 because it has such high quality and
1:48 really it's from I think 2018 uh that it
1:50 really started and really picked it up
1:52 and all the community and all the
1:54 research really uh all the research uh
1:56 Labs really took this architecture and
1:58 made like small improvements here small
2:00 improvements there and really the
2:02 quality is unmatched right like in the
2:04 way that it's built that in every layer
2:06 right in every Transformer block we have
2:09 the tension and block which essentially
2:12 has the connections between every token
2:14 to every token in the sequence all right
2:16 so this is something which is very very
2:18 expressive it allows you to get really
2:21 high quality outputs but it comes with a
2:23 quadratic complexity right you have to
2:26 keep this metrix both for memory and for
2:28 inerts so you're talking about context
2:30 size here so as the context gets bigger
2:32 as model gets figger there's a quadratic
2:35 growth in the overall size of the model
2:37 and then I guess compute cost and
2:38 latency and everything else is that what
2:40 we're talking about yeah definitely so
2:42 in shorter context like everything in
2:44 complexity right in shorter inputs
2:46 shorter context it doesn't really matter
2:48 right it can be whatever you want some
2:50 whatever function it what when it's
2:53 short context doesn't really matter but
2:55 right now we're like right in the
2:57 beginning right think about gpt3 right
3:00 2K context window yeah like right now
3:02 we're having like 1 million context
3:04 window 256 context window like the
3:07 standard like the basic thing it's 32 or
3:11 64 128k of context window so when we're
3:13 talking about this type of land then
3:15 like it's really meaningful and there
3:17 you really see that slow slow
3:20 performance or Transformers and really
3:22 right so it's like there's like if we're
3:24 talking just about the time so the in
3:26 the training time it's is something
3:29 which is clearly uh quadratic interest
3:32 time is also like originally quadratic H
3:34 but again when a lot of work has been
3:37 done to really improve improve that and
3:39 make it linear time it does come with a
3:41 cost right so it comes with a cost of
3:43 saving KV cash which essentially means
3:47 you're paying with memory yeah so again
3:48 these are the problems of like time and
3:52 memory that really keeps Transformers to
3:54 be broadly used in production everywhere
3:55 every time you in something which is
3:57 fast or with low memory consumption
3:59 which essentially translate to money
4:02 sure so talk to me about KV cache so
4:04 we're talking about in we're not talking
4:05 about the cache which is sort of like on
4:07 the outside of generation we're talking
4:08 about the cach internally within the
4:11 structure yeah so the KD cach is is part
4:14 of the the tension mechanism right there
4:17 the k k is the key this value okay so
4:20 the KV cach is just the way to save H
4:23 the all the like the the sequence that
4:24 you already had so you're saving it in
4:27 the cash and now in the next fit forward
4:29 maybe I'll even uh go back a bit okay
4:31 okay so how does it work I know people
4:33 hate hearing about it right it's like
4:36 the most basic basic thing to say but
4:38 you have a sequence and you have to feed
4:40 forward for every token right you're
4:42 just feeding it in model every token
4:45 until the generation stops okay and
4:46 essentially keep it in KV cach meaning
4:49 that you're keeping all the activation
4:52 sorry not the activation all the KV
4:56 layers value yes sorry of the exension
4:59 in the cache of all the all the
5:01 sequences you already computed so in the
5:03 next time that you're doing uh this feed
5:04 forward you can you don't have to
5:06 calculate it again you can just take it
5:08 from the cache this is how you go from a
5:12 quadratic to linear in the inference St
5:13 yeah so so trying to get more
5:15 performance out of the existing
5:18 Transformer architecture more speed and
5:20 more speed more speed yeah yeah yes
5:21 however it does come with the price of
5:23 cash which it's not estra right this is
5:26 something that if you have 80 gab of
5:29 memory it's it it's from this 80 gab and
5:30 if you
5:33 if you're looking at like models of like
5:37 something like mxtr a * 70 a with that
5:41 128 or 256 uh K uh context window or
5:45 context really this is like 32 40
5:46 gigabyte easily I don't remember the
5:49 exact numbers but it really this is one
5:50 of the things that stops you from being
5:52 able to use one GPU to serve this kind
5:56 of thing understand understand so so so
5:58 how do we what do we do to solve these
6:00 problems where does does Mamba come in
6:02 next into the conversation are we
6:04 talking mixture of experts there's a
6:05 there's a few different things that
6:07 people have done to try to solve some of
6:10 these problems yeah so we can talk about
6:12 it in two ways okay I think that I will
6:14 start with the easier one with the one
6:15 that most people know which is the
6:18 mixture of experts which this is
6:20 something that comes to solve only the
6:22 inference time only the spin
6:25 consideration where here you think about
6:26 the fact that you have like really
6:28 really big model okay you have a lot of
6:31 parameters but here every layer it's not
6:34 just like the transformer block but you
6:37 have uh in every layer something like
6:40 usually it's eight eight experts experts
6:42 yes yes H which it has like a really
6:45 nice intuition of think about the fact
6:46 that for every input you have some sort
6:49 of a router and then based on the type
6:51 of the input you can say well this is
6:53 like a medical input so it's go to a
6:55 medical expert this is like a finance
6:58 input that goes to a finance expert okay
7:00 which is nice in theory and there are
7:03 like models like this has originated in
7:04 something like this but here when we
7:07 talk about neural networks okay inside
7:09 here so what you have you do have this
7:11 type of router but it's token level
7:13 right so it's not doing it like you
7:14 don't ask it the question then all of a
7:16 sudden the finance experts answers it
7:19 it's actually like you so it's token by
7:22 token yeah it's token level through the
7:25 like the feeding forward in the network
7:26 and that's essentially what happens that
7:30 we have a model that that like the
7:32 the network is built on like have to set
7:35 the attention layer and then a router
7:37 and then like eight experts and it
7:39 passes only through two of them it's
7:42 true both for training and for the
7:45 infite okay and if someone asks me how
7:47 does that work my broad answer and I
7:49 wonder if you'll agree with this is we
7:51 don't really know but no we know it does
7:53 but we can see that it pairs down the
7:55 amount of compute and the amount of
7:56 parameters you have to go through each
8:00 time and somehow it works yes so this
8:02 the question of how does it work and I
8:03 don't really know but it works I think
8:06 it kind of describes uh machine learning
8:08 deep learning okay classical machine
8:10 learning maybe when it's kind of small
8:12 you kind actually understand something
8:13 but neural networks like the
8:16 explainability is it's a big problem
8:18 like you can't really understand what's
8:19 happening inside you can guess you can
8:22 like put trou stuff H but with language
8:24 models it's not something that a lot of
8:27 work was successful on that H but what
8:29 you can see is first of all you can see
8:31 that in the results right where what
8:33 happens is that you're just using
8:35 because in every uh feed forward you're
8:37 using two of eight expert these are like
8:42 the standard numbers so you're literally
8:44 using like some like a quarter of the
8:47 amount of parameters in the model which
8:49 is essentially translates to active
8:51 parameters so you can see that it has
8:54 better uh performance speed wise and
8:57 like the nice thing here is that you can
9:00 get a model which has like 12b uh
9:01 parameters which are active parameters
9:04 so you get like a fast model but it's
9:05 very high quality model because it
9:09 actually has 52 uh billion parameters
9:10 inside of it so it has the
9:13 expressiveness okay it's really it's
9:15 really good in the sense that it can get
9:17 a lot of information during training but
9:19 during the inference time okay this is
9:21 something that will only go through
9:23 small part of the model so it's very
9:25 very fast yeah you do have to store all
9:28 the model like all this all the all of
9:30 this do have to go to the memory like
9:33 you don't solve the memory issue there
9:35 you only solve the speed part you can
9:38 also see by the way like the when you
9:39 are training the model or when you're
9:42 doing inference you can see which uh
9:44 which of the experts are activated part
9:45 of the training process is to really
9:47 make sure that they are Balan that
9:49 because models will degradation it's
9:51 something that happens in a lot of
9:52 things for sure right so you really
9:54 don't want the model to always use the
9:57 same two or three experts because then
9:58 then you'd end up with a smaller model
10:00 right it would look bigger but it would
10:01 actually be a smaller model it's a
10:03 smaller model where you have to pay out
10:05 for the memory yeah so essentially like
10:07 the B like you don't get anything from
10:10 there so you can see like you can put
10:12 problems and see how many of them are
10:13 activated and you really want them to be
10:15 bced it's part of the training process
10:17 and we do test it right in the entrance
10:19 time to see that for different types of
10:22 uh inputs you are using all of them some
10:23 in some way like in
10:26 the so so we're talking still about
10:27 Transformer architectures at the moment
10:29 and I guess there's been a a number of
10:31 things we talked about a couple of them
10:34 there to try to um adapt and to improve
10:36 the efficiency and the performance and
10:38 the cost of running the models um all
10:41 with their benefits and drawbacks I I
10:42 remember when I was first talking about
10:44 Transformer architectures and I put a
10:46 course together about it and we talked
10:47 about it and I distinctly remember
10:50 saying um recurrent neural networks
10:51 they're a thing of the past that's the
10:53 way we used to do it now we're doing
10:55 Transformers but I got a feeling you're
10:58 about to tell me Well recurrent neural
11:00 networks are back is that right yes so I
11:03 I was one of the of the the wave to say
11:05 like oh recur neural network it was very
11:07 difficult to work with them H right it's
11:09 it's not easy to understand what's
11:11 happening like it's really not efficient
11:13 to train and the explainability there
11:16 was even worse than in other models but
11:19 it really seems that what happens now
11:21 with Mamba is it's actually funny you
11:23 can look at it like two uh in two
11:24 different ways you can either look at it
11:29 like as a evolution of RNN to linear RNN
11:32 to mamba or you can look at it as like a
11:35 state space uh Evolution from State
11:37 space models to selective State space
11:40 model which is again Mamba okay and the
11:44 point of all of these which just like
11:46 same things same principles at least in
11:50 different ways is instead of looking at
11:52 everything like all the history I would
11:54 say or all the sequence all the context
11:57 that you have in every uh step what
11:58 you're doing is that you're saving it in
12:01 all in some sort of State like something
12:03 that you can either think about like as
12:06 a quantization or representation or like
12:08 really how you take everything that you
12:10 had so far and keep it in a way which
12:13 will be meaningful to uh determine the
12:16 next token and in that case every time
12:18 now you have the Fe forward and you need
12:20 to predict the next token instead of
12:22 looking at all the contexts from the
12:25 back you looking at what happened like
12:27 as some sort of representation of all
12:30 the context that happened yeah so it's
12:33 either like the previous state or the
12:34 history right it's called H when we're
12:37 talking about RNN and state when we
12:39 talking about ssms and this is like
12:42 something that really emerged it really
12:44 reminds RNN but they actually H took it
12:48 from the SSM that's the state space
12:51 model Y and they really improved on like
12:53 the work of State space models in order
12:55 to build Mama yeah okay so we're not
12:57 back to RNN it just looks a bit like RNN
12:59 so of borrows from that we got these
13:03 states based models coming in again so
13:06 so so can you just describe then Mamba
13:08 like we we we've reached the point where
13:12 we're talking about Mamba so um what's
13:13 the performance like of Mamba and what
13:15 what are what are the problems of Mamba
13:17 yeah so I think before I'm talking about
13:19 the problems of Mamba okay let's talk
13:21 about the good thing yes about Mamba
13:24 which by the way if you'll ask someone
13:25 everyone who's in like the business and
13:27 I'm sure that you said the same thing
13:28 you literally by the way said the same
13:31 thing to me right now anybody who's like
13:33 10 years or so uh in the machine
13:35 learning business the first thing when
13:37 they hear about Mamba everybody's like
13:40 excited and they're like H just RNN like
13:42 it's a fancy RNN right that that's
13:44 really all it is and it's really yes
13:46 it's the same concept and that's like
13:48 the the amazing thing what Mamba
13:51 creation creators did H I will not go
13:53 into itth too much I will just say that
13:56 they took the state space models which
13:59 are very very efficient because all the
14:01 there's a lot of things there that you
14:04 can calculate before so it's kind of
14:06 like using uh CNN a commercial neural
14:09 network H to calculate those things so
14:11 it was very very efficient they
14:13 introduced something which is called
14:15 selective St face mobile uh where
14:18 essentially the representation is not un
14:21 it's not equal for every token right so
14:23 if you think about like the phrase I
14:26 want to eat hamburger right and you want
14:28 to predict the next word not all the
14:32 words have the same like want it's not
14:35 really giving us anything two a those
14:37 are like words that are not as
14:40 meaningful to uh to store when we are
14:43 determining the state so they did The
14:45 Selective part which essentially gives
14:47 like can think about like giving
14:50 different weights for every token sure
14:52 and then like through importance and
14:55 that's something that really improves H
14:58 the performance the problem is that now
15:00 those um metrices that they need to
15:02 calculate are no longer constant sure
15:04 they sound a lot like attention weights
15:07 than those things like I I think that a
15:11 lot it kind of is like idea in the idea
15:14 part it is and and that's like where
15:16 everything really connects right all
15:18 those like principles of okay we really
15:21 need it to be fast H we really want it
15:24 to be something we can and do fast
15:27 inference and increase performance but
15:30 we're still lacking in terms of quality
15:32 with Transformers and that's like what
15:34 the creators of Mamba had to show right
15:36 because that to plot the graphs that
15:38 they are faster than Transformers it's
15:40 not hard and it's just like it's
15:44 something which has right less there's
15:47 no KD cache there's no K&D there and the
15:50 all the handling of the context is
15:53 linear up to cont constant time what
15:55 they need to show is that they are
15:58 equivalent in quality yeah so the
16:00 selective part really help them to
16:02 improve the quality and they did had to
16:04 like do a lot of optimization like
16:07 Hardware optimization something like
16:09 deep in the core sure and they did like
16:11 another algorithm to Cate all those
16:14 things and but that was like the the
16:17 main premise of look we really managed
16:19 to improve quality it's literally like
16:21 if if you will read their papers you
16:22 will show you will see that they're
16:25 showing experiment where they are as
16:28 good as Transformers on several tasks
16:31 and and and much faster right and that's
16:32 really the premise right you either have
16:34 to go to like I'm improving quality or I'm
16:35 I'm
16:38 improving performance right and this is
16:40 something that like you cannot usually
16:43 do at all so they really showed that
16:45 they like elevated the state space model
16:47 essentially elevated that ends right
16:49 it's really kind of the same thing in
16:51 and in the end it really elevated it
16:54 made it better and really something that
16:57 compete with Transformers I will say
17:00 like in their work they got up to
17:03 like like few million few billions
17:06 parameters like until I think 7 billion
17:08 is like that's where they took it so
17:11 it's nice in theory uh but still needed
17:13 some more to show that you can scale
17:17 production okay so where do you go from
17:19 that like how do you build on top of
17:22 Mamba because I guess that's what Jambo
17:24 is right yeah so really when we wanted
17:27 to release our new line of models we
17:30 thought about how to make it best for
17:32 production best for developers how can
17:34 we take a model which is very very
17:37 expressive that is very high quality but
17:39 you can also fit it into a single a100
17:41 GQ and this is one of the things that
17:44 were in our requirements the beginning
17:48 and when we first saw Mamba okay this is
17:51 published in December 2023 in so it's
17:53 really new yes and we started to
17:55 experiment with that a little bit and
17:56 there was a lot of talk I remember about
17:58 maybe just like scale pure mamb just
18:00 just take this architecture make it
18:04 bigger and like which is not an easy
18:06 thing to do by itself but still just
18:08 like do a pure Mamba model and it turns
18:11 out that even though like on several
18:14 tasks it does work really well or like
18:16 comparable with Transformers it is
18:19 lacking in a lot of elements and I think
18:20 the place where you can see it the most
18:23 is like tasks which require looking at
18:26 specific tokens okay so there's like a
18:28 paper called a repeat after me um
18:31 Transformers are better than Mamba in
18:33 copying tasks here we actually have to
18:36 copy parts from the input or even like
18:39 easier to think about is fuse shock yeah
18:42 okay and there's like a very basic and
18:44 known data set IMDb reviews right of
18:46 sentiment analysis where if you will run
18:48 it and you by sentiment analysis you
18:50 kind of want like a binary input like
18:53 it's a classification task positive or
18:55 negative like these are the actual
18:57 labels if you'll give it a Transformer
19:00 it will just it it will do right but if
19:02 you'll give it to Mamba and this is like
19:03 one of the experiments that really
19:06 alerted us to this fact it will say
19:08 something like can bad all right so
19:11 positive or negative it says bad yeah so
19:13 it sort of gets the idea of what you're
19:14 trying to do but it gets the wrong
19:16 actual output which which obviously
19:18 could be so that that's significant I
19:20 guess because a lot of us are very used
19:23 to sort of what in in context learning
19:24 and then all the things that come from
19:26 that so Rag and everything else comes to
19:29 play um and sometimes
19:32 we want the model to to be specific
19:33 about the actual information we've just
19:35 given it that's really important to us
19:37 so I guess that's a problem yeah and and
19:39 really if you are if you want something
19:41 the developers will actually use right
19:44 exactly like you said output stability
19:46 is important right postprocessing is
19:48 something which is important and
19:50 something which is like okay
19:52 semantically has the same meaning that's
19:54 nice but it's not something you can
19:56 actually build with and that was the
19:59 time where we started to really
20:02 play with the idea of combining those
20:06 things and really uh one of the nice
20:09 things about Mamba is because of the way
20:11 because of this architecture it's much
20:13 efficient much more efficient to train
20:16 it than pure Transformers so what our
20:18 team did and they started to play around
20:20 with like inter living different types
20:23 of layers okay and essentially they
20:25 created what we call now Jambo blocks
20:28 which is interleaving layers of Mamba and
20:29 and
20:31 Transformers and of course they added
20:33 the mixture of experts as well over
20:35 there but that's like less interesting
20:36 right now okay so really they
20:38 combination of Mamba and Transformer
20:41 layers and really see on one hand you
20:44 really want as much as Mamba layers
20:46 versus Transformer layers right because
20:48 you want it to be fast but he do need
20:50 some Transformer layers in order to get
20:53 the same quality or to to take the Mamba
20:56 and elevate it to the places that just
21:00 cannot reach by itself so we did like a
21:02 lot of experiments on small
21:05 scale and it's really nice if you want
21:07 by the way we there's a lot of things in
21:10 the white paper I Rel like describing
21:13 all of them and in the end we came up
21:15 with like two different types of jumbo
21:17 blocks one of them had one Transformer
21:22 layer and three uh Mamba layers and one
21:24 was one personal layer and seven okay so
21:27 one to three and one to seven and really
21:30 all these numbers came in the fact that
21:32 well what we want to do is to be able to
21:35 take a model which is the model that we
21:38 ended up with had 52 billion total
21:41 parameters okay I wanted to be able to
21:44 serve this type of model on 1 a 100 GPU
21:47 with as much as context as possible so
21:49 in the choice between 1 to three and 1
21:53 to 7 right clearly 1 to 7 is much more
21:56 efficient in terms of late lency and
21:59 memory right right but it has the same
22:00 performance so we opted out to go for
22:03 that and this is how our jumbo block
22:06 looks like there's one Transformer layer
22:09 seven mble layers which four of them has
22:14 Mi of X-rays I don't that right so so so
22:15 we're mixing The Best of Both Worlds but
22:17 I mean through all of that research you
22:19 kind of figured out where hopefully
22:21 where The Sweet Spot is um there's
22:23 always more research to be done right
22:24 but but you found a place where there's
22:28 a good balance and you trained a large model
22:29 model
22:30 yeah so there were a lot of like
22:31 experiences exactly like you said
22:34 experiments sorry like said to find this
22:37 sweet spot which shows the 1 to7 and
22:40 then we H just try and really when you
22:42 want to scale something like that it's
22:44 not just is like okay let's connect a
22:46 lot of them like let's concatenate all
22:49 those layers together one at a time and
22:51 put it to training okay something that
22:54 does have uh some extra work it has to
22:57 be done just to make it to scale okay
22:59 okay and that's what with we like did it
23:01 in two phases okay so the first Jamba
23:05 was released on March this model has 52
23:08 billion parameters which is like now we
23:12 call it Jamba mini okay okay at 52 total
23:14 parameters with something like 12
23:16 billion active parameters and this was
23:19 like the like the first let's take it
23:21 from the few billions to something which
23:22 is production grade something that we
23:25 can actually like like like we see like
23:27 this is what we now call a small model
23:29 right it's kind of think about it that's
23:31 like a few years ago 7 billion would be
23:33 like a huge model and now like 52
23:36 billion with mixture of experts that has
23:39 like 12 billion active parameters sounds
23:40 like very very s yeah I think one thing
23:43 that generative I has done is redefine
23:45 small medium and larg as terms and what
23:47 they actually mean you you've talked
23:48 about the experimentation you've done
23:49 there and you've got the different size
23:53 of models um how and Earth how do you
23:54 how have you benchmarked it how do you
23:56 know I mean it's more than just a Vibe
23:58 check and you're just prompting it and
24:00 going yeah that looks good presumably
24:02 how do you quantify that it's it's
24:05 performance so we chose several academic
24:06 benchmarks where we wanted to make sure
24:09 that we have like benchmarks that are
24:12 tasks that different tasks and also
24:13 things that are extractive and
24:16 abstractive and because Mamba really
24:18 exceled by itself in obstructive tasks
24:20 about extractive where you actually need
24:23 to copy things from the input not so
24:26 much so we had like a lot of um like
24:28 combination of several of these days
24:30 benchmarks okay you can also look at the
24:33 training laws to see you can actually
24:37 actually the M converges and R like once
24:39 we got to like our final candidate we
24:42 did that we have a human evaluation team
24:45 in housee and so we used them to like
24:47 really determine and see that we are
24:49 going in the right direction and that
24:51 was like the first experimental
24:54 experiment then training the 52 billion
24:56 H model parameter that was released on
25:00 March and then like that was like the
25:02 big release like the announcement of
25:06 this architecture and then we took it a
25:08 anage to a model which is almost 400
25:10 billion total
25:13 parameters which is a Jamba 1.5 large
25:16 okay so that's what we released like I
25:18 think one month two months ago depending
25:20 on when
25:23 this okay so we like the jumbo 1.5 mini
25:26 is a like a fine tun version of the one
25:28 we released on March and but it's the
25:30 same type of architecture just as large
25:33 has a lots lots more of this Jambo block
25:36 sure inside of it and so um something
25:38 which is a bit new um I understand for
25:41 a21 so the the weights are publicly
25:45 available yes so one of the like key
25:47 things and we really we released the
25:50 base model for the Jambo mini on March
25:53 to really see how the community uh will
25:55 react to it and the responses were
25:58 amazing because I think that people
26:00 understand and that like
26:02 Transformers yes like everybody's
26:04 focused on Transformers and there's like
26:06 a lot of improvements a lot of like
26:08 tricks a lot of people you can ask for
26:10 well they're essentially a big Community
26:13 around Transformers but at some place it
26:16 becomes saturated and there's there's
26:18 only there's like a limit of the amount
26:21 of tricks you can do and in some case
26:23 somewhere someone has to say well we
26:25 maybe need new architecture for
26:26 different types of tasks or different
26:29 type of use cases or when we really need
26:31 long context and something that
26:34 Transformers really like take them too
26:38 much time and so we releas it and open
26:40 weits on March to really see what people
26:42 will do and you saw that there were like
26:44 a lot of downloads a lot of talk around
26:46 it like people were excited about that
26:48 and that's why when we like we launched
26:52 the new jumbo 1.5 series said well we
26:54 really want the developers will continue
26:55 to engage with that we really want to
26:58 create a community here because it's
26:59 like it's not something that is just
27:01 ours right it's not like we really want
27:03 people to adopt that we want people to
27:06 take this to the like group right there
27:08 we want people to really take it and to
27:11 the next level to build something around
27:13 it to really like take the research and
27:15 push it Forward because we do believe
27:17 that there's a lot of places where this
27:20 architecture can be improved y sure and
27:22 so if people want to get their hands on
27:25 it um then I guess the one of the
27:26 easiest ways to do that is Amazon
27:29 Bedrock right so yeah totally like if
27:30 you if you want to like get your hands
27:32 dirty and try to like fine tune and like
27:35 download models go to haen face for if
27:37 you like if you really want to like well
27:39 I don't care I just want to use a model
27:42 yeah right so Amazon Bedrock is totally
27:44 the way to go yeah right you can just go
27:46 there like you have large you have mini
27:49 H whatever you prefer like whatever is
27:51 for your use yeah whatever is for no and
27:53 I think that's I think that's really I I
27:55 think that's proven to be quite a
27:56 successful model I think developers
27:58 really Chim with that yeah the idea that
28:00 you can actually get your hands on it
28:02 you can rip it apart you can go and put
28:04 it wherever you want um so I'm assuming
28:07 that um so things like olama where we've
28:10 got these quantized small models I'm
28:12 assuming we're not going to see it there
28:14 anytime soon would that be right because
28:16 the architecture is quite different
28:19 right so you can actually like you can
28:21 quantize that so we create a new
28:24 quantization technique which is publicly
28:26 available in huging face which
28:29 essentially takes like the our model
28:33 from 16 bit to 8 Bits and then back with
28:37 no really information loss oh wow yeah
28:39 it's really it kind of like depends on
28:41 the fact that most like something like
28:44 90 or 95% of the weights are actually in
28:47 MLP uh layers so we found a way to
28:49 really make this uh quantization on the
28:51 Fly and it works really well you can
28:53 also by the way like in hugging face you
28:56 can change it like to quantize it in a
28:59 4bit so you can do that yeah I'm not
29:00 quite sure by the way on the other
29:04 platforms I'm like I think that we are
29:06 like you contact with them but like they
29:08 can do it themselves like it's not like
29:11 you already have like the Forbe and I'm
29:14 I'm it can be squeezed can be squeezed
29:16 totally I I'm not sure how it performs I
29:18 must say I didn't see it for a bit but
29:20 I'm excited to see it well look and I
29:21 think this is really exciting like I
29:23 mean AI 21 Labs is a small focused team
29:25 I think it's probably fair to describe
29:27 it like that and so um you you've got
29:29 the the they are publicly available
29:32 people can go and hack on it and um and
29:33 surprise you I guess and show you what
29:35 they've done with it as well yeah and
29:37 Matt I'm for one really excited to see
29:39 whatever the community will do whatever
29:41 anybody is doing I'm like Yay yes
29:43 absolutely well look thank you so much
29:45 for spending time with me and just going
29:47 through all of this there's a lot to
29:49 take in here and I think that um it's
29:50 really exciting to see things that are
29:53 being done which are um which are
29:54 looking elsewhere other than
29:56 Transformers and trying to find the next
29:58 PATH forward so thank you so much for
30:00 spending time with us thank you so much
30:03 for having me a huge thanks to U and
30:06 everybody from AI 21 labs for helping to
30:08 make this video please give this video a
30:10 thumbs up and subscribe to the AWS
30:12 developers Channel as well maybe click
30:15 on one of these videos around us and