0:00 have
0:04 you reflected a lot on how to select
0:07 Talent or has that mostly been like
0:09 intuitive to you Ilia just shows up and
0:11 you're like this is a clever guy let's
0:13 let's work together or have you thought
0:15 a lot about that can we are we recording
0:18 should we should we roll This yeah let's
0:20 roll this okay we're good yeah yeah
0:32 so I remember when I first got to K
0:34 melon from England in England at a
0:36 Research Unit it would get to be 6:00
0:39 and you'd all go for a drink in the pub
0:41 um at Caril melon I remember after I've
0:43 been there a few weeks it was Saturday
0:46 night I didn't have any friends yet and
0:47 I didn't know what to do so I decided
0:48 I'd go into the lab and do some
0:50 programming because I had a list machine
0:52 and you couldn't program it from home so
0:53 I went into the lab at about 9:00 on a
0:57 Saturday night and it was swarming all
0:59 the students were there and they were
1:01 all there because what they were working
1:03 on was the future they all believed that
1:05 what they did next was going to change
1:07 the course of computer science and it
1:09 was just so different from England and
1:12 so that was very refreshing take me back
1:16 to the very beginning Jeff at Cambridge
1:18 uh trying to understand the brain uh
1:21 what was that like it was very
1:24 disappointing so I did physiology and in
1:25 the summer term they were going to teach
1:27 us how the brain worked and it all they
1:30 taught us was how neurons conduct action
1:32 potentials which is very interesting but
1:34 it doesn't tell you how the brain works
1:36 so that was extremely disappointing I
1:38 switched to philosophy then I thought
1:40 maybe they'd tell us how the mind worked
1:42 um that was very disappointing I
1:43 eventually ended up going to Edinburgh
1:46 to do Ai and that was more interesting
1:48 at least you could simulate things so
1:50 you could test out theories and did you
1:53 remember what intrigued you about AI was
1:56 it a paper was it any particular person
1:59 that exposed you to those ideas I guess
2:01 it was a book I read by Donald Hebb that
2:05 influenced me a lot um he was very
2:07 interested in how you learn the
2:09 connection strengths in neural Nets I
2:11 also read a book by John Fon noyman
2:15 early on um who was very interested in
2:16 how the brain computes and how it's
2:19 different from normal computers and did
2:22 you get that conviction that this ideas
2:25 would work out at at that point or what
2:27 would was your intuition back at the
2:31 Edinburgh days it seemed to me there has
2:33 to be a way that the brain
2:36 learns and it's clearly not by having
2:39 all sorts of things programmed into it
2:40 and then using logical rules of
2:42 inference that just seemed to me crazy
2:46 from the outset um so we had to figure
2:49 out how the brain learned to modify
2:50 Connections in a neural net so that it
2:53 could do complicated things and Fon
2:55 Norman believed that churing believed
2:57 that so Forman and churing were both
2:58 pretty good at logic but they didn't
3:01 believe in this logical approach and
3:03 what was your split between studying the
3:05 ideas from from
3:08 neuroscience and just doing what seemed
3:11 to be good algorithms for for AI how
3:13 much inspiration did you take early on
3:15 so I never did that much study of
3:17 Neuroscience I was always inspired by
3:19 what I'd learned about how the brain
3:21 works that there's a bunch of neurons
3:23 they perform relatively simple
3:26 operations they're nonlinear um but they
3:29 collect inputs they wait them and then
3:31 they an output that depends on that
3:33 weighted input and the question is how
3:34 do you change those weights to make the
3:36 whole thing do something good it seems
3:38 like a fairly simple question what
3:41 collaborations do you remember from from
3:43 that time the main collaboration I had
3:45 at Carnegie melon was with someone who
3:47 wasn't at carnegy melon I was
3:48 interacting a lot with Terry sinowski
3:51 who was in Baltimore at John's Hopkins
3:53 and about once a month either he would
3:54 drive to Pittsburg or I drive to
3:57 Baltimore it's 250 miles away and we
3:58 would spend a weekend together working
4:00 on boltimore machines that was a
4:01 wonderful collaboration we were both
4:03 convinced it was how the brain worked
4:05 that was the most exciting research I've
4:07 ever done and a lot of technical results
4:09 came out that were very interesting but
4:11 I think it's not how the brain works um
4:13 I also had a very good collaboration
4:17 with um Peter Brown who was a very good
4:19 statistician and he worked on speech
4:22 recognition at IBM and then he came as a
4:24 more mature student to kind melon just
4:27 to get a PhD um but he already knew a
4:30 lot he taught me a lot about spee
4:31 and he in fact taught me about hidden
4:33 Markov models I think I learn more from
4:35 him than he learned from me that's the
4:38 kind of student you want and when he Tau
4:41 me about hidden Markov models I was
4:43 doing back propop with hidden layers
4:44 only they weren't called hidden layers
4:47 then and I decided that name they use in
4:49 Hidden Markov models is a great name for
4:50 variables that you don't know what
4:54 they're up to um and so that's where the
4:57 name hidden in neur NS came from me and
4:59 P decided that was a great name for the
5:03 hidden hidden L and your all Nets um but
5:05 I learned a lot from Peter about speech
5:08 take us back to when Ilia showed up at
5:11 your at your office I was in my office I
5:14 probably on a Sunday um and I was
5:16 programming I think and there was a
5:17 knock on the door not just any knock but
5:19 it won't
5:21 cutter it's sort of an urgent knock so I
5:23 went and answer to the door and this was
5:25 this young student there and he said he
5:27 was cooking Fries over the summer but
5:29 he'd rather be working in my lab and so
5:30 I said well why don't you make an
5:32 appointment and we'll talk and so Ilia
5:35 said how about now and that sort of was
5:38 Ila's character so we talked for a bit
5:40 and I gave him a paper to read which was
5:42 the nature paper on back
5:45 propagation and we made another meeting
5:47 for a week later and he came back and he
5:49 said I didn't understand it and I was
5:50 very disappointed I thought he seemed
5:52 like a bright guy but it's only the
5:54 chain rule it's not that hard to
5:56 understand and he said oh no no I
5:58 understood that I just don't understand
6:00 why you don't give the gradient to a
6:02 sensal a sensible function
6:04 Optimizer which took us quite a few
6:07 years to think about um and it kept on
6:09 like that with a he had very good his
6:11 raw intuitions about things were always
6:14 very good what do you think had enabled
6:17 those uh those intuitions for for Ilia I
6:19 don't know I think he always thought for
6:21 himself he was always interested in AI
6:24 from a young age um he's obviously good
6:27 at math so but it's very hard to know
6:29 and what was that collaboration between
6:32 between the two of you like what part
6:34 would you play and what part would Ilia
6:37 play it was a lot of fun um I remember
6:41 one occasion when we were trying to do a
6:43 complicated thing with producing maps of
6:46 data where I had a kind of mixture model
6:47 so you could take the same bunch of
6:50 similarities and make two maps so that
6:52 in one map Bank could be close to Greed
6:54 and in another map Bank could be close
6:57 to River um cuz in one map you can't
6:59 have it close to both right cuz River
7:01 and greed along wayon so we'd have a
7:05 mixture maps and we were doing it in mat
7:06 lab and this involved a lot of
7:08 reorganization of the code to do the
7:10 right Matrix multiplies and only got fed
7:12 up with that so he came one day and said
7:15 um I'm going to write a an interface for
7:17 Matlab so I program in this different
7:19 language and then I have something that
7:21 just converts it into Matlab and I said
7:24 no Ilia um that'll take you a month to
7:26 do we've got to get on with this project
7:28 don't get diverted by that and I said
7:34 morning and that's that's quite quite
7:37 incredible and throughout those those
7:40 years the biggest shift wasn't
7:42 necessarily just the the algorithms but
7:45 but also the the skill how did you sort
7:49 of view that skill uh over over the
7:51 years Ilia got that intuition very early
7:55 so Ilia was always preaching that um you
7:56 just make it bigger and it'll work
7:58 better and I always thought that was a
7:59 bit of a copout do you going to have to
8:02 have new ideas too it turns out I was
8:04 basically right new ideas help things
8:06 like Transformers helped a lot but it
8:09 was really the scale of the data and the
8:11 scale of the computation and back then
8:13 we had no idea computers would get like
8:15 a billion times faster we thought maybe
8:17 they' get a 100 times faster we were
8:19 trying to do things by coming up with
8:21 clever ideas that would have just solved
8:22 themselves if we had had bigger scale of
8:25 the data and computation in about
8:28 2011 Ilia and another graduate student
8:30 called James Martins and
8:32 had a paper using character level
8:35 prediction so we took Wikipedia and we
8:39 tried to predict the next HTML character
8:41 and that worked remarkably well and we
8:43 were always amazed at how well it worked
8:47 and that was using a fancy Optimizer on
8:50 gpus and we could never quite believe
8:52 that it understood anything but it
8:53 looked as though it
8:55 understood and that just seemed
8:58 incredible can you take us through how
9:01 are do models trained to predict the
9:06 next word and why is it the wrong way of
9:08 of thinking about them okay I don't
9:12 actually believe it is the wrong way so
9:13 in fact I think I made the first
9:15 neuronet language model that used
9:18 embeddings and back propagation so it's
9:19 very simple data just
9:23 triples and it was turning each symbol
9:25 into an embedding then having the
9:27 embeddings interact to predict the
9:29 embedding of the next symbol and from
9:31 that predic the next symbol and then it
9:32 was back propagating through that whole
9:35 process to learn these triples and I
9:38 showed it could generalize um about 10
9:40 years later Yoshua Benji used a very
9:41 similar Network and showed it work with
9:44 real text and about 10 years after that
9:46 linguist started believing in embeddings
9:49 it was a slow process the reason I think
9:52 it's not just predicting the next symbol
9:54 is if you ask well what does it take to
9:56 predict the next symbol particularly if
9:59 you ask me a question and then the first
10:03 word of the answer is the next symbol um
10:06 you have to understand the question so I
10:08 think by predicting the next
10:11 symbol it's very unlike oldfashioned
10:13 autocomplete oldfashioned autocomplete
10:16 you'd store sort of triples of words and
10:18 then if you sort a pair of words you see
10:20 how often different words came third and
10:22 that way you can predict the next symbol
10:23 and that's what most people think auto
10:26 complete is like it's no longer at all
10:28 like that um to predict the next symbol
10:30 you have to understand what's been said
10:31 so I think you're forcing it to
10:33 understand by making it predict the next
10:36 symbol and I think it's understanding in
10:38 much the same way we are so a lot of
10:40 people will tell you these things aren't
10:42 like us um they're just predicting the
10:44 next symbol they're not reasoning like
10:47 us but actually in order to predict the
10:48 next symbol it's have going to have to
10:50 do some reasoning and we've seen now
10:52 that if you make big ones without
10:53 putting in any special stuff to do
10:55 reasoning they can already do some
10:57 reasoning and I think as you make them
10:58 bigger they're going to be able to do
11:00 more and more reasoning do you think I'm
11:01 doing anything else than predicting the
11:04 next symbol right now I think that's how
11:06 you're learning I think you're
11:08 predicting the next video frame um
11:11 you're predicting the next sound um but
11:13 I think that's a pretty plausible theory
11:16 of how the brain's learning what enables
11:19 these models to learn such a wide
11:21 variety of of fields what these big
11:23 language models are doing is they
11:25 looking for common structure and by
11:27 finding common structure they can encode
11:29 things using the common structure and
11:31 that more efficient so let me give you
11:33 an example if you ask
11:36 gp4 why is a compost heap like an atom
11:39 bomb most people can't answer that most
11:41 people haven't thought they think atom
11:42 bombs and compost heeps are very
11:44 different things but gp4 will tell you
11:46 well the energy scales are very
11:48 different and the time scales are very
11:51 different but the thing that's the same
11:52 is that when the compost Heep gets
11:55 hotter it generates heat faster and when
11:57 the atom bomb produces more NE neutrons
12:00 it produces more neutrons faster
12:02 and so it gets the idea of a chain
12:04 reaction and I believe it's understood
12:06 they're both forms of chain reaction
12:08 it's using that understanding to
12:09 compress all that information into its
12:13 weights and if it's doing that then it's
12:15 going to be doing that for hundreds of
12:16 things where we haven't seen the
12:18 analogies yet but it has and that's
12:20 where you get creativity from from
12:21 seeing these analogies between
12:23 apparently very different things and so
12:25 I think gp4 is going to end up when it
12:27 gets bigger being very creative I think
12:29 this idea that it's just just
12:31 regurgitating what it's learned just
12:33 pasing together text it's learned
12:35 already that's completely wrong it's
12:37 going to be even more creative than
12:40 people I think you'd argue that it won't
12:43 just repeat the human knowledge we've
12:46 developed so far but could also progress
12:48 beyond that I think that's something we
12:51 haven't quite seen yet we've started
12:53 seeing some examples of it but to a to a
12:56 large extent we're sort of still at the
12:58 current level of of of science what do
13:00 you think will enable it to go beyond
13:01 that well we've seen that in more
13:04 limited context like if you take Alpha
13:08 go in that famous competition with Leo
13:11 um there was move 37 where Alpha go made
13:13 a move that all the experts said must
13:15 have been a mistake but actually later
13:18 they realized it was a brilliant move um
13:20 so that was created within that limited
13:22 domain um I think we'll see a lot more
13:25 of that as these things get bigger the
13:28 difference with alphao as well was that
13:31 it was using reinforcement learning that
13:33 that subsequently sort of enabled it to
13:35 to go beyond the current state so it
13:37 started with imitation learning watching
13:39 how humans play the game and then it
13:42 would through selfplay develop Way
13:43 Beyond that do you think that's the
13:46 missing component of the I think that
13:48 may well be a missing component yes that
13:51 the the self-play in Alpha in Alpha go
13:54 and Alpha zero are are a large part of
13:56 why it could make these creative moves
13:58 but I don't think it's entirely necessary
13:59 necessary
14:01 so there's a little experiment I did a
14:03 long time ago where you your training in
14:06 neuronet to recognize handwritten digits
14:09 I love that example the mest example and
14:11 you give it training data where half the
14:12 answers are
14:15 wrong um and the question is how well
14:17 will it
14:20 learn and you make half the answers
14:23 wrong once and keep them like that so it
14:25 can't average away the wrongness by just
14:27 seeing the same example but with the
14:28 right answer sometimes and the wrong
14:29 answer sometimes when it sees that
14:32 example half half of the examples when
14:33 it sees the example the answer is always
14:37 wrong and so the training data has 50%
14:40 error but if you train up back
14:44 propagation it gets down to 5% error or
14:49 less other words from badly labeled data
14:51 it can get much better results it can
14:54 see that the training data is wrong and
14:55 that's how smart students can be smarter
14:57 than their advisor and their advisor
14:59 tells them all this stuff
15:01 and for half of what their advisor tells
15:03 them they think no rubbish and they
15:05 listen to the other half and then they
15:06 end up smarter than the advisor so these
15:09 big neural Nets can actually do they can
15:11 do much better than their training data
15:13 and most people don't realize that so
15:16 how how do you expect this models to add
15:19 reasoning in into them so I mean one
15:20 approach is you add sort of the
15:23 heuristics on on top of them which a lot
15:25 of the research is doing now where you
15:26 have sort of Shan of thought you just
15:29 feedback it's reasoning um in into
15:32 itself and another way would be in the
15:34 model itself as you scale scale scale it
15:38 up what's your intuition around that so
15:40 my intuition is that as we scale up
15:42 these models I get better at reasoning
15:44 and if you ask how people work roughly
15:47 speaking we have these
15:50 intuitions and we can do reasoning and
15:52 we use the reasoning to correct our
15:54 intuitions of course we use the
15:55 intuitions during the reasoning to do
15:57 the reasoning but if the conclusion of
15:58 the reasoning conflicts with our in
16:00 itions we realize the intuitions need to
16:03 be changed that's much like in Alpha go
16:06 or Alpha zero where you have an
16:09 evaluation function um that just looks
16:10 at a board and says how good is that for
16:13 me but then you do the Monte Cara roll
16:17 out and now you get a more accurate idea
16:18 and you can revise your evaluation
16:20 function so you can train it by getting
16:22 it to agree with the results of
16:23 reasoning and I think these large
16:26 language models have to start doing that
16:28 they have to start training their Raw
16:30 intuitions about what should come next
16:32 by doing reasoning and realizing that's
16:35 not right and so that way they can get
16:37 more training data than just mimicking
16:40 what people did and that's exactly why
16:43 alphao could do this creative move 37 it
16:44 had much more training data because it
16:47 was using reasoning to check out what
16:49 the right next move should have been and
16:52 what do you think about multimodality so
16:54 we spoke about these analogies and often
16:56 the analogies are Way Beyond what we
16:59 could see it's discovering analogy that
17:01 are far beyond humans and at maybe
17:03 abstraction levels that we'll never be
17:06 able to to to understand now when we
17:09 introduce images to that and and video
17:11 and sound how do you think that will
17:14 change the models and uh how do you
17:16 think it will change the analogies that
17:19 it will be able to make um I think it'll
17:21 change it a lot I think it'll make it
17:23 much better at understanding spatial
17:26 things for example from language alone
17:27 it's quite hard to understand some
17:30 spatial things although remarkably gp4
17:32 can do that even before it was
17:35 multimodal um but when you make it
17:38 multimodal if you have it both doing
17:40 vision and reaching out and grabbing
17:42 things it'll understand object much
17:44 better if it can pick them up and turn
17:47 them over and so on so although you can
17:50 learn an awful lot from language it's
17:53 easier to learn if you multimodal and in
17:55 fact you then need less language and
17:57 there's an awful lot of YouTube video
17:59 for predicting the next frame so or
18:01 something like that so I think these
18:03 multimodule models are clearly going to
18:06 take over um you can get more data that
18:08 way they need less language so there's
18:10 really a philosophical point that you
18:12 could learn a very good model from
18:14 language alone but it's much easier to
18:16 learn it from a multimodal system and
18:18 how do you think it will impact the
18:21 model's reasoning I think it'll make it
18:22 much better at reasoning about space for
18:24 example reasoning about what happens if
18:26 you pick objects up if you actually try
18:27 picking objects up you're going to get
18:29 all sorts of training data that's going
18:32 to help do you think the human brain
18:35 evolved to work well with with language
18:37 or do you think language evolved to work
18:40 well with the human brain I think the
18:41 question of whether language evolved to
18:43 work with the brain or the brain evolved
18:44 to work with language I think that's a
18:48 very good question I think both happened
18:50 I used to think we would do a lot of
18:52 cognition without needing language at
18:57 all um now I've changed my mind a bit so
18:59 let me give you three different views of
19:01 language um and how it relates to
19:03 cognition there's the oldfashioned
19:05 symbolic view which is cognition
19:10 consists of having strings of symbols in
19:12 some kind of cleaned up logical language
19:14 where there's no ambiguity and applying
19:15 rules of inference and that's what
19:17 cognition is it's just these symbolic
19:19 manipulations on things that are like
19:22 strings of language symbols um so that's
19:24 one extreme view an opposite extreme
19:27 view is no no once you get inside the
19:30 head it's all vectors so symbols come in
19:32 you convert those symbols into big
19:34 vectors and all the stuff inside's done
19:36 with big vectors and then if you want to
19:38 produce output you produce symbols again
19:40 so there was a point in machine
19:42 translation in about
19:44 2014 when people were using neural
19:46 recurrent neural Nets and words will
19:48 keep coming in and that have a hidden
19:50 State and they keep accumulating
19:52 information in this hidden state so when
19:55 they got to the end of a sentence that
19:56 have a big hidden Vector that captures
19:59 the meaning of that sentence that could
20:00 then be used for producing the sentence
20:02 in another language that was called a
20:04 thought vector and that's a sort of
20:05 second view of language you convert the
20:08 language into a big Vector that's
20:10 nothing like language and that's what
20:12 cognition is all about but then there's
20:15 a third view which is what I believe now
20:20 which is that you take these
20:23 symbols and you convert the symbols into
20:25 embeddings and you use multiple layers
20:26 of that so you get these very rich
20:28 embeddings but the embeddings are still
20:30 to the symbols in the sense that you've
20:31 got a big Vector for this symbol and a
20:34 big Vector for that symbol and these
20:36 vectors interact to produce the vector
20:39 for the symbol for the next word and
20:40 that's what understanding is
20:42 understanding is knowing how to convert
20:44 the symbols into these vectors and
20:45 knowing how the elements of the vector
20:47 should interact to predict the vector
20:49 for the next symbol that's what
20:50 understanding is both in these big
20:52 language models and in our
20:55 brains and that's an example which is
20:57 sort of in between you're staying with
21:00 the symbols but you're interpreting them
21:02 as these big vectors and that's where
21:04 all the work is and all the knowledge is
21:06 in what vectors you use and how the
21:08 elements of those vectors interact not
21:09 in symbolic
21:13 rules um but it's not saying that you
21:14 get away from the symbols all together
21:16 it's saying you turn the symbols into
21:18 big vectors but you stay with that
21:20 surface structure of the symbols and
21:22 that's how these models are working and
21:24 that's I seem to be a more plausible
21:26 model of human thought too you were one
21:30 of the first folks to get idea of using
21:34 gpus and I know yansen loves you for
21:36 that uh back in 2009 you mentioned that
21:38 you told yansen that this could be a
21:41 quite good idea um for for training
21:43 training neural Nets take us back to
21:46 that early intuition of of using gpus
21:48 for for training neural Nets so actually
21:50 I think in about
21:53 2006 I had a former graduate student
21:55 called Rick zisy who's a very good
21:58 computer vision guy and I talked to him
22:00 and a meeting and he said you know you
22:02 ought to think about using Graphics
22:03 processing cards because they're very
22:05 good at Matrix multiplies and what
22:07 you're doing is basically all matric
22:09 multiplies so I thought about that for a
22:11 bit and then we learned about these
22:16 Tesla systems that had um four gpus in
22:21 and initially we just got um gaming gpus
22:22 and discovered they made things go 30
22:24 times faster and then we bought one of
22:27 these Tesla systems with 4 gpus and we
22:30 did speech on that and it worked very
22:34 well then in 2009 I gave a talk at nips
22:36 and I told a thousand machine learning
22:37 researches you should all go and buy
22:39 Nvidia gpus they're the future you need
22:42 them for doing machine learning and I
22:45 actually um then sent mail to Nvidia
22:46 saying I told a thousand machine
22:48 learning researchers to buy your boards
22:49 could you give me a free one and they
22:51 said no actually they didn't say no they
22:54 just didn't reply um but when I told
22:55 Jensen this story later on he gave me a free
22:57 free
23:00 one that's uh that's very very good I I
23:02 think what's interesting is um as well
23:05 is sort of how gpus has evolved
23:07 alongside the the field so where where
23:10 do you think we we should go go next in
23:13 in the in the compute so my last couple
23:15 of years at Google I was thinking about
23:17 ways of trying to make analog
23:19 computation so that instead of using
23:21 like a megawatt we could use like 30
23:23 Watts like the brain and we could run
23:26 these big language models in analog
23:29 hardware and I never made it
23:32 work and but I started really
23:36 appreciating digital computation so if
23:38 you're going to use that low power analog
23:39 analog
23:41 computation every piece of Hardware is
23:43 going to be a bit different and the idea
23:45 is the learning is going to make use of
23:47 the specific properties of that hardware
23:49 and that's what happens with people all
23:52 our brains are different um so we can't
23:54 then take the weights in your brain and
23:56 put them in my brain the hardware is
23:58 different the precise properties of the
23:59 individual ual neurons are different the
24:01 learning used to make has learned to
24:04 make use of all that and so we're mortal
24:05 in the sense that the weights in my
24:07 brain are no good for any other brain
24:10 when I die those weights are useless um
24:12 we can get information from one to
24:13 another rather
24:16 inefficiently by I produce sentences and
24:18 you figure out how to change your weight
24:20 so you would have said the same thing
24:22 that's called distillation but that's a
24:24 very inefficient way of communicating
24:27 knowledge and with digital systems
24:29 they're immortal because once you got
24:31 some weights you can throw away the
24:32 computer just store the weights on a
24:34 tape somewhere and now build another
24:36 computer put those same weights in and
24:39 if it's digital it can compute exactly
24:41 the same thing as the other system did
24:45 so digital systems can share weights and
24:48 that's incredibly much more efficient if
24:50 you've got a whole bunch of digital
24:51 systems and they each go and do a tiny
24:52 bit of
24:54 learning and they start with the same
24:56 weights they do a tiny bit of learning
24:58 and then they share their weights again
24:59 um they all know what all the others
25:03 learned we can't do that and so they're
25:04 far superior to us in being able to
25:07 share knowledge a lot of the ideas that
25:10 have been deployed in the field are very
25:13 old school ideas uh it's the ideas that
25:15 have been around the Neuroscience for
25:17 forever what do you think is sort of
25:19 left to to to apply to the systems that
25:23 we develop so one big thing that we
25:26 still have to catch up with Neuroscience
25:31 on is the time scales for changes so in
25:34 nearly all the neural Nets there's a
25:35 fast time scale for changing activities
25:38 so input comes in the activities the
25:40 embedding vectors all change and then
25:41 there's a slow time scale which is
25:43 changing the weights and that's
25:45 long-term learning and you just have
25:48 those two time scales in the brain
25:49 there's many time scales at which
25:53 weights change so for example if I say
25:56 an unexpected word like cucumber and now
25:58 5 minutes later you put headphones on
26:00 there's a lot of noise and there's very
26:03 faint words you'll be much better at
26:05 recognizing the word cucumber because I
26:08 said it 5 minutes ago so where is that
26:10 knowledge in the brain and that
26:12 knowledge is obviously in temporary
26:14 changes to synapsis it's not neurons are
26:16 going cucumber cucumber cucumber you
26:18 don't have enough neurons for that it's
26:21 in temporary changes to the weights and
26:22 you can do a lot of things with
26:24 temporary weight changes fast what I
26:26 call fast weights we don't do that in
26:28 these neural models and the reason we
26:31 don't do it is because if you have
26:33 temporary changes to the weights that
26:37 depend on the input data then you can't
26:38 process a whole bunch of different cases
26:41 at the same time at present we take a
26:43 whole bunch of different strings we
26:45 stack them stack them together and we
26:47 process them all in parallel because
26:48 then we can do Matrix Matrix multiplies
26:51 which is much more efficient and just
26:53 that efficiency is stopping us using
26:56 fast weights but the brain clearly uses
26:59 fast weights for temporary memory and
27:00 there's all sorts of things you can do
27:02 that way that we don't do at present I
27:03 think that's one of the biggest things
27:04 we have to learn I was very hopeful that
27:08 things like graph core um if they went
27:11 sequential and did just online learning
27:13 then they could use fast weights
27:16 um but that hasn't worked out yet I
27:18 think it'll work out eventually when
27:19 people are using conductances for
27:23 weights how has knowing how this models
27:26 work and knowing how the brain works
27:29 impacted the way you you think I think
27:33 there's been one big impact which is at
27:35 a fairly abstract level which is that
27:37 for many
27:40 years people were very scornful about
27:42 the idea of having a big random neural
27:44 net and just giving a lot of training
27:46 data and it would learn to do
27:47 complicated things if you talk to
27:50 statisticians or linguists or most
27:53 people in AI they say that's just a pipe
27:54 dream there's no way you're going to
27:56 learn to really complicated things
27:57 without some kind of innate knowledge
27:59 without a lot of architectural
28:00 restrictions it turns out that's
28:03 completely wrong you can take a big
28:04 random neural network and you can learn
28:08 a whole bunch of stuff just from data um
28:10 so the idea that stochastic gradient
28:13 descent to adjust the repeatedly adjust
28:16 the weights using a gradient that will
28:17 learn things and we'll learn big
28:21 complicated things that's been validated
28:23 by these big models and that's a very
28:25 important thing to know about the brain
28:27 it doesn't have to have all this innate
28:28 structure now obviously it's got a lot
28:32 of innate structure but it certainly
28:33 doesn't need innate structure for things
28:35 that are easily
28:37 learned and so the sort of idea coming
28:39 from Chomsky that you won't you won't
28:41 learn anything complicated like language
28:43 unless it's all kind of wired in already
28:46 and just matures that idea is now
28:49 clearly nonsense I'm sure shumsky would
28:51 appreciate you calling his ideas
28:54 nonsense well I think actually I think a
28:56 lot of chs's political ideas are very
28:59 sensible and I'm was struck by how how
29:00 come someone with such sensible ideas
29:02 about the Middle East could be so wrong about
29:03 about
29:05 Linguistics what do you think would make
29:09 these models simulate consciousness of
29:12 of humans more effectively but imagine
29:14 you had the AI assistant that you've
29:16 spoken to in your entire life and
29:19 instead of that being you know like chat
29:21 today that sort of deletes the memory of
29:23 the conversation and you start fresh all
29:26 of the time okay it had
29:28 self-reflection at some point you you
29:32 pass away and you tell that to to the
29:35 assistant do you think I me not me
29:38 somebody else tells that toist yeah you
29:39 would it would be difficult for you to
29:42 tell that to the assistant um do you
29:44 think that assistant would would feel at
29:46 that point yes I think they can have
29:49 feelings too so I think just as we have
29:51 this inner theater model for perception
29:53 we have an inthat model for feelings
29:55 they're things that I can experience but
29:59 other people can't um
30:02 I think that model is equally wrong so I
30:04 think suppose I say I feel like punching
30:07 Gary on the nose which I often do let's
30:10 try and Abstract that away from the idea
30:12 of an inner theater what I'm really
30:16 saying to you is um if it weren't for
30:17 the inhibition coming from my frontal
30:20 loes I would perform an action so when
30:22 we talk about feelings we really talking
30:25 about um actions we would perform if it
30:29 weren't for um con straints and that
30:31 really that's really what feelings are
30:32 the actions we would do if it weren't for
30:33 for
30:36 constraints um so I think you can give
30:37 the same kind of explanation for
30:39 feelings and there's no reason why these
30:42 things can't have feelings in fact in
30:46 1973 I saw a robot having an emotion so
30:49 in Edinburgh they had a robot with two
30:51 grippers like this that could assemble a
30:54 toy car if you put the pieces separately
30:58 on a piece of green felt um but if you
31:01 put them in a pile his vision wasn't
31:02 good enough to figure out what was going
31:05 on so it put his grip whack and it
31:06 knocked them so they were scattered and
31:08 then it could put them together if you
31:10 saw that in a person you say it was
31:11 crossed with the situation because it
31:13 didn't understand it so it destroyed
31:16 it that's
31:19 profound you uh we spoke previously you
31:22 described sort of humans and and and and
31:24 the llms as analogy machines what do you
31:27 think has been the most powerful
31:30 analogies that you found throughout your
31:36 life oh in throughout my life um woo I
31:40 guess probably an a sort of weak analogy
31:45 that's influenced me a lot is um the
31:48 analogy between religious belief and
31:50 between belief in symbol
31:52 processing so when I was very young I
31:54 was confronted I came from an atheist
31:56 family and went to school and was
31:58 confronted with religious belief and it
32:00 just seemed nonsense to me it still
32:03 seems nonsense to me um and when I saw
32:04 symbol processing as an explanation how
32:08 people worked um I thought it was just
32:10 the same
32:12 nonsense I don't think it's quite so
32:15 much nonsense now because I think
32:17 actually we do do symbol processing it's
32:19 just we do it by giving these big
32:21 embedding vectors to the symbols but we
32:24 are actually symbol processing um but
32:25 not at all in the way people thought
32:27 where you match symbols and the only
32:29 thing is symbol has is it's identical to
32:31 another symbol or it's not identical
32:33 that's the only property a symbol has we
32:35 don't do that at all we use the context
32:37 to give embedding vectors to symbols and
32:39 then use the interactions between the
32:41 components of these embedding vectors to
32:44 do thinking but there's a very good
32:46 researcher at Google called Fernando
32:50 Pereira who said yes we do have symbolic
32:52 reasoning and the only symbolic we have
32:54 is natural language natural language is
32:55 a symbolic language and we reason with
32:58 it and I believe that now you've done
33:00 some of the most meaningful uh research
33:03 in the history of of computer science
33:05 can you walk us through like how do you
33:08 select the right problems to to work on
33:11 well first let me correct you me and my
33:12 students have done a lot of the most
33:15 meaningful things and it's mainly been a
33:17 very good collaboration with students
33:19 and my ability to select very good
33:21 students and that came from the fact
33:23 that were very few people doing neural
33:25 Nets in the 70s and 80s and 90s and
33:28 2000s and so the few people doing your
33:30 nets got to pick the very best students
33:33 so that was a piece of luck but my way
33:35 of selecting problems is
33:37 basically well you know when scientists
33:40 talk about how they work they have
33:41 theories about how they work which
33:42 probably don't have much to do with the
33:45 truth but my theory is that
33:48 I look for something where everybody's
33:50 agreed about something and it feels
33:52 wrong just there's a slight intuition
33:54 there's something wrong about it and
33:56 then I work on that and see if I can
33:58 elaborate why it is I think it's wrong
34:00 and maybe I can make a little demo with
34:04 a small computer program that shows that
34:06 it doesn't work the way you might expect
34:09 so let me take one example um most
34:11 people think that if you add noise to a
34:14 neural net is going to work worse um if
34:16 for example each time you put a training
34:19 example through
34:22 you make half of the neurons be silent
34:26 it'll work worse actually we know it'll
34:28 generalize better if you do that
34:32 and you can demonstrate that um in a
34:34 simple example that's what's nice about
34:36 computer simulation you can show you
34:38 know this idea you had that adding noise
34:39 is going to make it worse and sort of
34:41 dropping out half the neurons will make
34:42 it work worse which you will in the
34:45 short term but if you train it with like
34:47 that in the end it'll work better you
34:48 can demonstrate that with a small
34:49 computer program and then you can think
34:53 hard about why that is and how it stops
34:56 big elaborate co- adaptations um but
34:58 that I think that that's my method of
35:00 working find something that sounds
35:03 suspicious and work on it and see if you
35:05 can give a simple demonstration of why
35:07 it's wrong what sounds suspicious to you
35:10 now well that we don't use fast weight
35:12 sounds suspicious that we only have
35:14 these two time scales that's just wrong
35:17 that's not at all like the brain um and
35:18 in the long run I think we're going to
35:20 have to have many more time scans so
35:23 that's an example there and if you had
35:25 if you had your group of of students
35:26 today and they came to you and they said
35:27 so the Hamming question that we talked
35:29 about previously you know what's the
35:31 most important problem in in in your
35:33 field what would you suggest that they
35:36 take on and work on on next we spoke
35:38 about reasoning time scales what would
35:40 be sort of the highest priority Problem
35:43 that that you'd give them for me right
35:45 now it's the same question I've had for
35:48 the last like 30 years or so which is
35:51 does the brain do back propagation I
35:52 believe the brain is getting gradients
35:54 if you don't get gradients your learning
35:56 is just much worse than if you do get
35:58 gradients but how is the brain getting
36:01 gradients and is it
36:03 somehow implementing some approximate
36:04 version of back propagation or is it
36:06 some completely different technique
36:09 that's a big open question and if I kept
36:11 on doing research that's what I would be
36:13 doing research on and when you look back
36:16 at at your career now you've been right
36:18 about so many things but what were you
36:20 wrong about that you wish you sort of
36:23 spent less time pursuing a certain
36:25 direction okay those are two separate
36:26 questions one is what were you wrong
36:28 about and two do you wish you'd less
36:31 spent less time on it I think I was
36:33 wrong about Boltz machines and I'm glad
36:35 I spent a long time on it there are much
36:37 more beautiful theory of how you get
36:39 gradients than back propagation back
36:40 propagation is just ordinary and
36:42 sensible and it's just a chain rule B
36:44 machines is clever and it's a very
36:47 interesting way to get gradients and I
36:49 would love for that to be how the brain
36:52 works but I think it isn't did you spend
36:54 much time imagining what would happen
36:57 post the systems developing as as well
36:59 did you have an idea that okay if we
37:00 could make these systems work really
37:02 well we could you know democratize
37:04 education we could make knowledge way
37:07 more accessible um we could solve some
37:10 tough problems in in in medicine or was
37:13 it more to you about understanding the
37:17 Brin yes I I sort of feel scientists
37:18 ought to be doing things that are going
37:22 to help Society but actually that's not
37:23 how you do your best research you do
37:25 your best research when it's driven by
37:28 curiosity you just have to understand
37:32 something um much more recently I've
37:33 realized these things could do a lot of
37:35 harm as well as a lot of good and I've
37:37 become much more concerned about the
37:39 effects they're going to have on society
37:41 but that's not what was motivating me I
37:42 just wanted to understand how on Earth
37:45 can the brain learn to do things that's
37:47 what I want to know and I sort of failed
37:49 as a side effect of that failure we got
37:51 some nice engineering
37:54 but yeah it was a good good good failure
37:56 for the world if you take the lens of
37:59 the things that could go really right
38:01 what what do you think are the most promising
38:02 promising
38:05 applications I think Health Care is
38:09 clearly uh a big one um with Health Care
38:12 there's almost no end to how much Health
38:14 Care Society can absorb if you take
38:18 someone old they could use five doctors
38:21 fulltime um so when AI gets better than
38:25 people at doing things um you'd like it
38:27 to get better in areas where you could
38:30 do with a lot more of that stuff and we
38:32 could do with a lot more doctors if
38:33 everybody had three doctors of their own
38:35 that would be great and we're going to
38:38 get to that point um so that's one
38:41 reason why Healthcare is good there's
38:44 also just a new engineering developing
38:46 new materials for example for better
38:49 solar panels or for superc conductivity
38:52 or for just understanding how the Body
38:55 Works um there's going to be huge
38:57 impacts there those are all going to be
39:00 be good things what I worry about is Bad
39:02 actors using them for bad things we've
39:05 facilitated people like Putin or Z or Trump
39:06 Trump
39:10 using AI for Killer Robots or for
39:12 manipulating public opinion or for Mass
39:14 surveillance and those are all very
39:17 worrying things are you ever concerned
39:20 that slowing down the field could also
39:23 slow down the positives oh absolutely
39:26 and I think there's not much chance that
39:29 the field will slow down partly because
39:31 it's International and if one country
39:32 slows down the other countries aren't
39:35 going to slow down so there's a race
39:37 clearly between China and the US and
39:39 neither is going to slow down so yeah I
39:41 don't I mean there was this partition
39:42 saying we should slow down for six
39:44 months I didn't sign it just because I
39:46 thought it was never going to happen I
39:47 maybe should have signed it because even
39:49 though it was never going to happen it
39:51 made a political point it's often good
39:53 to ask for things you know you can't get
39:55 just to make a point um but I didn't
39:57 think we're going to slow down and how
39:59 do you think that it will impact the AI
40:03 research process uh having uh this
40:04 assistance so I think it'll make it a
40:06 lot more efficient a research will get a
40:08 lot more efficient when you've got these
40:11 assistants that help you program um but
40:12 also help you think through things and
40:14 probably help you a lot with equations
40:17 too have you reflected much on the
40:19 process of selecting Talent has that
40:22 been mostly intuitive to you like when
40:24 Ilia shows up at the door you feel this
40:27 is smart guy let's work together so for
40:30 selecting Talent um sometimes you just
40:32 know so after talking to Ilia for not
40:35 very long he seemed very smart and then
40:36 talking him a bit more he clearly was
40:38 very smart and had very good intuitions
40:41 as well as being good at math so that
40:43 was a no-brainer there's another case
40:47 where I was at a NPS conference um we
40:50 had a poster and I someone came up and
40:52 he started asking questions about the
40:54 poster and every question he asked was a
40:56 sort of deep insight into what we'd done
40:59 wrong um and after 5 minutes I offered
41:01 him a postto position that guy was David
41:04 McKai who was just brilliant and it's
41:07 very sad he died but he was it was very
41:10 obvious you'd want him um other times
41:12 it's not so obvious and one thing I did
41:15 learn was that people are different
41:17 there's not just one type of good
41:21 student um so there's some students who
41:23 aren't that creative but are technically
41:26 extremely strong and will make anything
41:28 work there's other students who aren't
41:31 technically strong but are very creative
41:32 of course you want the ones who are both
41:34 but you don't always get that but I
41:36 think actually in the lab you need a
41:38 variety of different kinds of graduate
41:41 student but I still go with my gut
41:43 intuition that sometimes you talk to
41:45 somebody and they're just very very they
41:48 just get it and those are the ones you
41:51 want what do you think is the reason for
41:54 some folks having better intuition do
41:56 they just have better training data than
42:00 than others or how can you develop your
42:03 intuition I think it's partly they don't
42:06 stand for nonsense so here's a way to
42:08 get bad intuitions believe everything
42:12 you're told that's fatal you have to be
42:14 able to I think here's what some people
42:15 do they have a whole framework for
42:17 understanding reality and when someone
42:20 tells them something they try and sort
42:22 of figure out how that fits into their
42:24 framework and if it doesn't they just
42:28 reject it and that's a very good
42:30 strategy um people who try and
42:33 incorporate whatever they're told end up
42:35 with a framework that's sort of very
42:38 fuzzy and sort of can believe everything
42:41 and that's useless so I think actually
42:44 having a strong view of the world and
42:46 trying to manipulate incoming facts to
42:48 fit in with your view obviously it can
42:51 lead you into deep religious belief and
42:53 fatal flaws and so on like my belief in
42:56 boltzman machines um but I think that's
42:58 the way to go if you got good intuitions
43:00 you can trust you should trust them if
43:03 you got bad intuitions it doesn't matter
43:05 what you do so you might as well trust
43:09 them a very very good very good point
43:12 when when you look at the the types of
43:15 research that's that's that's being done
43:17 today do you think we're putting all of
43:19 our eggs in one basket and we should
43:22 diversify our ideas a bit more in in the
43:24 field or do you think this is the most
43:26 promising Direction so let's go all in
43:28 on it
43:30 I think having big models and training
43:33 them on multimodal data even if it's
43:35 only to predict the next word is such a
43:37 promising approach that we should go
43:39 pretty much all in on it obviously
43:40 there's lots and lots of people doing it
43:42 now and there's lots of people doing
43:45 apparently crazy things and that's good
43:47 um but I think it's fine for like most
43:48 of the people to be following this path
43:50 because it's working very well do you
43:54 think that the learning algorithms
43:56 matter that much or is it just a skill
43:59 are there basically millions of ways
44:01 that we could we could get to human
44:03 level in in intelligence or are there
44:05 sort of a select few that we need to
44:08 discover yes so this issue of whether
44:10 particular learning algorithms are very
44:12 important or whether there's a great
44:13 variety of learning algorithms that'll
44:16 do the job I don't know the answer it
44:19 seems to me though that back propagation
44:20 there's a sense in which it's the
44:23 correct thing to do getting the gradient
44:24 so that you change a parameter to make
44:27 it work better that seems like the right
44:30 thing to do and it's been amazingly
44:32 successful there may well be other
44:34 learning algorithms that are alternative
44:36 ways of getting that same gradient or
44:37 that are getting the gradient to
44:40 something else and that also work
44:43 um I think that's all open and a very
44:45 interesting issue now about whether
44:47 there's other things you can try and
44:50 maximize that will give you good systems
44:51 and maybe the brain's doing that because it's
44:52 it's
44:55 easier but backprop is in a sense the
44:57 right thing to do and we know that doing
44:59 it works really
45:02 well and one last question when when you
45:04 look back at your sort of Decades of
45:05 research what are you what are you most
45:07 proud of is it the students is it the
45:09 research what what makes you most proud
45:11 of when you look back at at your life's
45:14 work the learning algorithm for
45:16 boltimore machines so the learning
45:17 algorithm for Boltz machines is
45:21 beautifully elegant it's maybe hopeless
45:25 in practice um but it's the thing I
45:27 enjoyed most developing that with Terry
45:31 and it's what I'm proudest of um even if it's
45:31 it's [Music]
45:36 [Music]
45:39 wrong what questions do you spend most
45:41 of your time thinking about now is it
45:43 the um what what should I watch on Netflix