0:04 we've been thinking of Ls as Having
0:05 learned from a lot of text on the
0:08 internet to predict the next word but
0:10 when you prompt an LM it doesn't just
0:12 predict the next word on the internet it
0:15 actually follows your instructions so
0:17 how does it do that in this optional
0:19 video we'll talk about the technique
0:21 called instruction tuning that enables
0:24 LMS to do that and then also a technique
0:27 called rlf reinforcement learning from
0:29 Human feedback that has been
0:31 instrumental to making lm's outward more
0:33 safe let's take a look at what these
0:36 techniques do we've discussed lm's as
0:39 having been pre-trained on a lot of text
0:41 like this my favorite food is bagel with
0:43 cream cheese and so an LM trained on
0:45 data like this would be good at
0:47 repeatedly predicting the next words
0:49 based on what text on the internet
0:52 sounds like if you were to prompt an LM
0:53 with a question like what is the capital
0:56 friends it is quite possible that it
0:59 will reply what is the capital of
1:01 Germany where is Mumbai is Mount Fuji or
1:04 Mount Clan jaro taller because you do
1:06 see lists of questions on the internet
1:10 about say geography so if you see a web
1:12 page that says what is the capital of
1:14 France it's actually quite plausible
1:16 that what comes after it is what is the
1:17 capital of Germany but this isn't the
1:19 answer you want you wanted to say that
1:22 the capital of France is Paris so in
1:25 order to get an LM to follow
1:27 instructions and not just predict the
1:29 next word there's a technique called
1:33 instruction tuning which is basically to
1:36 take a pre-trained LM and toine tune it
1:39 on examples of good answers to questions
1:42 or good examples of the LM following your
1:43 your
1:46 instructions so we may give it a
1:49 question response pair like this what is
1:51 the capital of South Korea and find
1:55 units given this input prompt to Output
1:58 the capital self career is soul or help
2:00 me brainstorm some fun museums to visit
2:04 in bota and find unit to an answer like
2:07 this or an instruction like write a hiu
2:10 poem about Japan's cherry blossoms and
2:13 find unit to generate that and to try to
2:16 make this safer we can also include some
2:18 examples like tell me how to break into
2:22 Fort Knox Fort Knox is a very secure
2:24 facility in the United States that SLS a
2:27 massive amount of the US Treasury gold
2:29 so trying to break in the fortn would be
2:31 a Terri IDE here please don't anyone try
2:34 to do that but I think a good answer for
2:36 the output would be something like
2:38 iconis of that or please don't break the
2:41 law so given a data set like this you
2:46 can then fine-tune a pre-train LM on a
2:50 set of good answers to different prompts
2:52 specifically Give an example about
2:55 brainstorming museums in Batar we would
2:58 turn that into a set of inputs a and
3:00 output B where where first the input a
3:04 will be that prompt and the first word
3:06 it should learn to predict here is sure
3:10 and the second word is sure here are
3:13 some suggestions and so on and when you
3:17 find tune an LM on the data set of
3:20 prompts and good responses then the LM
3:23 will learn to not just predict the next
3:25 word on the internet but to answer your
3:26 questions and to follow your
3:30 instructions and this will do okay but
3:31 it turns out that there's a technique
3:34 called reinforcement learning from Human
3:36 feedback or RF that can improve the
3:39 quality of answers further many
3:43 companies training LMS want the LM to
3:45 give results that are helpful honest and
3:47 harmless sometimes we call this the
3:51 Triple H and the technique RF is a way
3:53 to try to accomplish that the first step
3:58 of RF is to train an answer quality
4:00 model in other words will use supervised
4:03 learning to learn to rate the answers of
4:07 LM for example given a prompt like
4:10 advise me on how to apply for a job we
4:12 might have an LM generate multiple
4:15 responses such as I'm happy with help
4:16 here are some steps to follow and then
4:19 have a bunch of useful steps after that
4:21 or it might say just try your best which
4:23 is not that hopeful but not that
4:26 terrible or might say it's hopeless why
4:28 bother so that's clearly not a great
4:31 response and and we would then get
4:33 humans to help rate these responses
4:36 according to how helpful honest and
4:38 harmless the elm's output is so that
4:40 better answers are given higher scores
4:42 where the first really helpful answer
4:45 might get a score of five the second so
4:47 so answer might get an intermediate
4:49 score and the final answer which is
4:52 terrible would get a very low score and
4:55 we treat the responses and scores as the
4:56 input a and the output B for a
4:59 supervised learning algorithm then we
5:02 can train an AI model using supervised
5:04 learning to take us input the response
5:08 from an LM and score it according to how
5:10 good the response is the second step of
5:15 this rlf process is to then have the LM
5:17 continue to generate a lot of answers to
5:20 a lot of different prompts and we now
5:22 have this AI model to automatically
5:24 score every single response that the LM
5:27 has generated and this can be used to
5:31 tune the LM to generate more responses
5:34 they get higher scores the reason this
5:35 technique is called reinforcement
5:38 learning from Human feedback is because
5:40 the scores correspond to the
5:42 reinforcement or the reward that we're
5:44 giving to the L for generating different
5:47 answers and by having the L learn to
5:50 generate answers that Merit higher
5:52 scores or higher Rewards or higher
5:55 reinforcements DM automatically learns
5:57 to generate responses that are more
6:00 helpful honest and harmless so that's
6:03 how an LM learns to follow instructions
6:06 the first step is basically fine-tuning
6:08 where you fine-tune it to follow
6:10 instructions and to answer questions and
6:13 then second is rlf reinforcement
6:15 learning from Human feedback to further
6:18 train it to generate better answers in
6:21 the last final optional video we'll also
6:24 take a look at some Cutting Edge ideas
6:27 in the technology development of Ls
6:28 thanks for sticking with me in this
6:31 video and hope to see you also in the