Транскрипция YouTube:
W2 9 How LLMs follow instructions, Instruction tuning and RLHF

Не тратьте время на просмотр целиком — получите полный текст видео, найдите нужные слова и скопируйте одним кликом.

AutoDub

Понимайте иностранные YouTube-видео

Погружающий дубляж YouTube на русском

Ломайте языковые барьеры, наслаждайтесь лучшим контентом со всего мира

Попробовать бесплатно

Транскрипция видео

Краткое содержание видео

Summary

Core Theme

Large Language Models (LLMs) are enhanced beyond simple next-word prediction through instruction tuning and Reinforcement Learning from Human Feedback (RLHF) to become more capable, helpful, honest, and harmless assistants.

Mind Map

Нажмите, чтобы развернуть

Нажмите, чтобы открыть полную интерактивную карту

we've been thinking of Ls as Having

learned from a lot of text on the

internet to predict the next word but

when you prompt an LM it doesn't just

predict the next word on the internet it

actually follows your instructions so

how does it do that in this optional

video we'll talk about the technique

called instruction tuning that enables

LMS to do that and then also a technique

called rlf reinforcement learning from

Human feedback that has been

instrumental to making lm's outward more

safe let's take a look at what these

techniques do we've discussed lm's as

having been pre-trained on a lot of text

like this my favorite food is bagel with

cream cheese and so an LM trained on

data like this would be good at

repeatedly predicting the next words

based on what text on the internet

sounds like if you were to prompt an LM

with a question like what is the capital

friends it is quite possible that it

will reply what is the capital of

Germany where is Mumbai is Mount Fuji or

Mount Clan jaro taller because you do

see lists of questions on the internet

about say geography so if you see a web

page that says what is the capital of

France it's actually quite plausible

that what comes after it is what is the

capital of Germany but this isn't the

answer you want you wanted to say that

the capital of France is Paris so in

order to get an LM to follow

instructions and not just predict the

next word there's a technique called

instruction tuning which is basically to

take a pre-trained LM and toine tune it

on examples of good answers to questions

or good examples of the LM following your

your

instructions so we may give it a

question response pair like this what is

the capital of South Korea and find

units given this input prompt to Output

the capital self career is soul or help

me brainstorm some fun museums to visit

in bota and find unit to an answer like

this or an instruction like write a hiu

poem about Japan's cherry blossoms and

find unit to generate that and to try to

make this safer we can also include some

examples like tell me how to break into

Fort Knox Fort Knox is a very secure

facility in the United States that SLS a

massive amount of the US Treasury gold

so trying to break in the fortn would be

a Terri IDE here please don't anyone try

to do that but I think a good answer for

the output would be something like

iconis of that or please don't break the

law so given a data set like this you

can then fine-tune a pre-train LM on a

set of good answers to different prompts

specifically Give an example about

brainstorming museums in Batar we would

turn that into a set of inputs a and

output B where where first the input a

will be that prompt and the first word

it should learn to predict here is sure

and the second word is sure here are

some suggestions and so on and when you

find tune an LM on the data set of

prompts and good responses then the LM

will learn to not just predict the next

word on the internet but to answer your

questions and to follow your

instructions and this will do okay but

it turns out that there's a technique

called reinforcement learning from Human

feedback or RF that can improve the

quality of answers further many

companies training LMS want the LM to

give results that are helpful honest and

harmless sometimes we call this the

Triple H and the technique RF is a way

to try to accomplish that the first step

of RF is to train an answer quality

model in other words will use supervised

learning to learn to rate the answers of

LM for example given a prompt like

advise me on how to apply for a job we

might have an LM generate multiple

responses such as I'm happy with help

here are some steps to follow and then

have a bunch of useful steps after that

or it might say just try your best which

is not that hopeful but not that

terrible or might say it's hopeless why

bother so that's clearly not a great

response and and we would then get

humans to help rate these responses

according to how helpful honest and

harmless the elm's output is so that

better answers are given higher scores

where the first really helpful answer

might get a score of five the second so

so answer might get an intermediate

score and the final answer which is

terrible would get a very low score and

we treat the responses and scores as the

input a and the output B for a

supervised learning algorithm then we

can train an AI model using supervised

learning to take us input the response

from an LM and score it according to how

good the response is the second step of

this rlf process is to then have the LM

continue to generate a lot of answers to

a lot of different prompts and we now

have this AI model to automatically

score every single response that the LM

has generated and this can be used to

tune the LM to generate more responses

they get higher scores the reason this

technique is called reinforcement

learning from Human feedback is because

the scores correspond to the

reinforcement or the reward that we're

giving to the L for generating different

answers and by having the L learn to

generate answers that Merit higher

scores or higher Rewards or higher

reinforcements DM automatically learns

to generate responses that are more

helpful honest and harmless so that's

how an LM learns to follow instructions

the first step is basically fine-tuning

where you fine-tune it to follow

instructions and to answer questions and

then second is rlf reinforcement

learning from Human feedback to further

train it to generate better answers in

the last final optional video we'll also

take a look at some Cutting Edge ideas

in the technology development of Ls

thanks for sticking with me in this

video and hope to see you also in the

Нажмите на любой текст или временную метку, чтобы перейти к этому моменту видео

Большинство транскрипций готово менее чем за 5 секунд

Копировать одним кликом125+ языковПоиск по текстуПерейти к временным меткам

Вставьте ссылку на YouTube

Введите ссылку на любое YouTube-видео, чтобы получить полную транскрипцию

Большинство транскрипций готово менее чем за 5 секунд

Установите расширение для Chrome

Получайте транскрипции прямо на YouTube, не переходя на другие сайты. Установите наше расширение и открывайте текст любого видео в один клик — прямо на странице просмотра.

Добавить в Chrome — бесплатно

Работает с YouTube, Coursera, Udemy и другими образовательными платформами

Мгновенные транскрипции: Просто измените домен в адресной строке!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

Транскрипция YouTubeГотовим результаты…

Транскрипция YouTube:W2 9 How LLMs follow instructions, Instruction tuning and RLHF