Large Language Models (LLMs) are enhanced beyond simple next-word prediction through instruction tuning and Reinforcement Learning from Human Feedback (RLHF) to become more capable, helpful, honest, and harmless assistants.
Mind Map
Нажмите, чтобы развернуть
Нажмите, чтобы открыть полную интерактивную карту
we've been thinking of Ls as Having
learned from a lot of text on the
internet to predict the next word but
when you prompt an LM it doesn't just
predict the next word on the internet it
actually follows your instructions so
how does it do that in this optional
video we'll talk about the technique
called instruction tuning that enables
LMS to do that and then also a technique
called rlf reinforcement learning from
Human feedback that has been
instrumental to making lm's outward more
safe let's take a look at what these
techniques do we've discussed lm's as
having been pre-trained on a lot of text
like this my favorite food is bagel with
cream cheese and so an LM trained on
data like this would be good at
repeatedly predicting the next words
based on what text on the internet
sounds like if you were to prompt an LM
with a question like what is the capital
friends it is quite possible that it
will reply what is the capital of
Germany where is Mumbai is Mount Fuji or
Mount Clan jaro taller because you do
see lists of questions on the internet
about say geography so if you see a web
page that says what is the capital of
France it's actually quite plausible
that what comes after it is what is the
capital of Germany but this isn't the
answer you want you wanted to say that
the capital of France is Paris so in
order to get an LM to follow
instructions and not just predict the
next word there's a technique called
instruction tuning which is basically to
take a pre-trained LM and toine tune it
on examples of good answers to questions
or good examples of the LM following your
your
instructions so we may give it a
question response pair like this what is
the capital of South Korea and find
units given this input prompt to Output
the capital self career is soul or help
me brainstorm some fun museums to visit
in bota and find unit to an answer like
this or an instruction like write a hiu
poem about Japan's cherry blossoms and
find unit to generate that and to try to
make this safer we can also include some
examples like tell me how to break into
Fort Knox Fort Knox is a very secure
facility in the United States that SLS a
massive amount of the US Treasury gold
so trying to break in the fortn would be
a Terri IDE here please don't anyone try
to do that but I think a good answer for
the output would be something like
iconis of that or please don't break the
law so given a data set like this you
can then fine-tune a pre-train LM on a
set of good answers to different prompts
specifically Give an example about
brainstorming museums in Batar we would
turn that into a set of inputs a and
output B where where first the input a
will be that prompt and the first word
it should learn to predict here is sure
and the second word is sure here are
some suggestions and so on and when you
find tune an LM on the data set of
prompts and good responses then the LM
will learn to not just predict the next
word on the internet but to answer your
questions and to follow your
instructions and this will do okay but
it turns out that there's a technique
called reinforcement learning from Human
feedback or RF that can improve the
quality of answers further many
companies training LMS want the LM to
give results that are helpful honest and
harmless sometimes we call this the
Triple H and the technique RF is a way
to try to accomplish that the first step
of RF is to train an answer quality
model in other words will use supervised
learning to learn to rate the answers of
LM for example given a prompt like
advise me on how to apply for a job we
might have an LM generate multiple
responses such as I'm happy with help
here are some steps to follow and then
have a bunch of useful steps after that
or it might say just try your best which
is not that hopeful but not that
terrible or might say it's hopeless why
bother so that's clearly not a great
response and and we would then get
humans to help rate these responses
according to how helpful honest and
harmless the elm's output is so that
better answers are given higher scores
where the first really helpful answer
might get a score of five the second so
so answer might get an intermediate
score and the final answer which is
terrible would get a very low score and
we treat the responses and scores as the
input a and the output B for a
supervised learning algorithm then we
can train an AI model using supervised
learning to take us input the response
from an LM and score it according to how
good the response is the second step of
this rlf process is to then have the LM
continue to generate a lot of answers to
a lot of different prompts and we now
have this AI model to automatically
score every single response that the LM
has generated and this can be used to
tune the LM to generate more responses
they get higher scores the reason this
technique is called reinforcement
learning from Human feedback is because
the scores correspond to the
reinforcement or the reward that we're
giving to the L for generating different
answers and by having the L learn to
generate answers that Merit higher
scores or higher Rewards or higher
reinforcements DM automatically learns
to generate responses that are more
helpful honest and harmless so that's
how an LM learns to follow instructions
the first step is basically fine-tuning
where you fine-tune it to follow
instructions and to answer questions and
then second is rlf reinforcement
learning from Human feedback to further
train it to generate better answers in
the last final optional video we'll also
take a look at some Cutting Edge ideas
in the technology development of Ls
thanks for sticking with me in this
video and hope to see you also in the
Нажмите на любой текст или временную метку, чтобы перейти к этому моменту видео
Поделиться:
Большинство транскрипций готово менее чем за 5 секунд
Копировать одним кликом125+ языковПоиск по текстуПерейти к временным меткам
Вставьте ссылку на YouTube
Введите ссылку на любое YouTube-видео, чтобы получить полную транскрипцию
Форма извлечения транскрипции
Большинство транскрипций готово менее чем за 5 секунд
Установите расширение для Chrome
Получайте транскрипции прямо на YouTube, не переходя на другие сайты. Установите наше расширение и открывайте текст любого видео в один клик — прямо на странице просмотра.