0:04 whereas rag gives you one way to give
0:06 additional information to a Lun language
0:08 model there's another technique called
0:11 fine-tuning which is another way to give
0:13 it more information in particular if you
0:16 have context that is bigger that can fit
0:18 into the input length for the input
0:20 context window length for the LM then
0:22 fine tuning gives you another way to get
0:25 an LM to absorb this information and
0:27 fine tuning also turns out to be useful
0:29 for getting the LM output text in a
0:31 certain in given style but this actual
0:35 implementation is a bit harder than rag
0:37 let's take a look let's say you have an
0:40 LM trained the way that we had described
0:42 previously with sentence found on the
0:44 internet like my favorite food is a
0:46 bagel with cream cheese then it may have
0:48 learned from hundreds of billions of
0:50 words or maybe more than a trillion
0:53 words to predict the next word like this
0:55 an El like this will have learned to
0:57 generate text that sounds like what's on
0:59 the internet and this process of
1:01 training a large language model on a lot
1:04 of data is often called pre-training now
1:07 let's say I want to modify the LM to
1:09 have a relentlessly positive and
1:12 optimistic attitude about everything
1:14 there's a technique called fine-tuning
1:18 that we can use to cause the LM to do a
1:21 little bit more learning to change its
1:24 outputs to be in this example much more
1:26 positive and optimistic to fine tune the
1:29 LM we would come up with a set of
1:32 sentences a set of texts that takes on a
1:34 positive optimistic attitude such as
1:36 what a wonderful chocolate click or the
1:40 novel was thrilling given text like this
1:43 you can then create an additional data
1:45 set using what a wonderful chocolate
1:48 cake you would have given what next word
1:50 it will try to predict a what a next
1:52 word is wonderful what a wonderful
1:55 chocolate and so one and it turns out
1:58 that if you take an LM that has been
2:00 pre-trained on hundreds of billions of
2:03 words and fine-tune it on just an
2:05 additional say 10,000 words or more
2:07 could be 100,000 words if you have more
2:09 data or even a million words if even
2:12 more data F tuning to this relatively
2:15 modest Siz data set can shift the output
2:18 of your LM to take on this positive
2:21 optimistic attitude now maybe shifting
2:23 an LM to have a relentlessly positive
2:25 attitude isn't that helpful an
2:29 application but fine-tuning is used in
2:32 many real applications one class of
2:34 applications that fine tuning is useful
2:37 is when the task isn't easy to Define in
2:41 a prom for example if you want to use an
2:46 L to summarize customer service calls a
2:49 generic om May locally call like this
2:50 and summarize it to say the customer
2:53 tells the agent about a problem with a
2:55 monitor but if you run a customer call
2:58 center you might want it to generate
3:00 specifics of about what the conversation
3:04 was about it was about the MK 4127 KX
3:06 reported broken by customer
3:10 542 and so on and if you create a data
3:14 set with maybe just hundreds of examples
3:17 of human expert written summaries and
3:19 have a large language model that's
3:21 learned from hundreds of billions of
3:23 words on the internet so it's learned a
3:26 lot of general knowledge on the internet
3:28 but if you additionally fine tune it on
3:31 maybe just hundred of carefully
3:34 handwritten summaries of this specific
3:36 style then that would shift the L's
3:39 ability to write summaries in the style
3:41 that you want and the specific style of
3:44 summary is actually not that easy to
3:46 Define in a text prompt maybe you could
3:48 do it but fine tuning would just be a
3:51 very precise way to tell the Elum what
3:54 summaries you want another example of
3:56 when a task isn't easy to Define in a
3:58 prompt is if you want to mimic a
4:02 specific writing or speaking style so
4:04 Tommy Nelson who's been working with me
4:06 on this course actually tried kind of
4:09 just for fun to get an LM to sound like
4:12 me but it turns out that the way most
4:15 individuals sound is not that easy to
4:17 describe in a prompt I mean how would
4:21 you give someone clear instructions to
4:24 sound like me so if you were to prompt a
4:27 general prosum and ask it to sound like
4:30 me you get texts like this which I don't
4:32 think it sounds that much like me but if
4:34 were to take a lot of transcripts of the
4:37 way I actually talk and have an OM be
4:40 fine-tuned to train it to really sound
4:43 exactly like me by learning on my actual
4:46 words then asking it to write something
4:48 that sounds like me results in text like
4:50 this which I don't know this sounds more
4:52 like how I would talk but because
4:55 mimicking a specific writing or speaking
4:58 style is very difficult to do VI
4:59 prompting because just difficult to
5:02 describe a specific person's Style by
5:05 writing text instructions fine tuning
5:08 turns out to be a more effective way to
5:12 get an alarm to speak in a certain style
5:14 and if you're building an artificial
5:17 character maybe a cartoon character fine
5:19 tuning could also be a way to get an Al
5:22 to speak in a certain style other than
5:24 Ts that AR easy to Define in the prompt
5:27 a second broad class of applications of
5:30 fine tuning is to help the um gain a
5:33 domain of knowledge for example if you
5:35 want an OM to be able to read and
5:39 process medical notes this is what a
5:41 medical note written about a patient by
5:43 a doctor might look like and this is
5:47 really not normal English PT is patient
5:49 Co complaining of s so shortness of
5:53 breath doe dis near on exertion PE this
5:55 is the results of the physical
5:57 examination and so on treatment is the
5:59 follow up with the primary care
6:02 physician stat chess x-ray continuing
6:05 treatment as needed on oxygen but this
6:08 is really not normal English and if you
6:11 were to take an LM trained on normal
6:13 English it wouldn't be very good at
6:16 processing text like this so if you were
6:19 to find T LM on a collection of medical
6:22 records then the LM could get much
6:24 better at absorbing this body of
6:26 knowledge about what medical notes sound
6:28 like and you could then use that to
6:31 build other appications on top of it to
6:34 better understand medical records or
6:37 legal documents here's a piece of legal
6:40 Le kind of written by lawyers for
6:43 lawyers that's really difficult for non-
6:45 lawyers to read license GRS licy Pro
6:47 section 2 A3 and non-exclusive right and
6:51 so on and so on within 15 days hereof I
6:53 don't know about you I do not use the
6:56 word he of in my ordinary day-to-day
6:58 speech but this is what legal documents
7:01 sound like and if if you want your LM to
7:03 gain a body of knowledge about how to
7:07 read and understand legal documents then
7:10 take an LM and fine-tuning it to legal
7:12 documents would help it to gain that
7:14 body of knowledge and similarly
7:16 financial documents too fine-tuning and
7:20 LM on a large set of financial documents
7:23 would help it to better gain that body
7:26 of knowledge about finance and make it
7:28 better at applications involving
7:31 processing documents that look like this
7:34 finally another reason to find t om is
7:39 to get a smaller model to perform a task
7:41 that may previously have required a
7:43 larger model we'll discuss later this
7:45 week some of the pros and cons of
7:48 choosing a larger versus a smaller model
7:51 but for some applications that need a
7:54 lot of knowledge or need complex
7:57 reasoning you might use a relatively
8:00 large model say with over 100 billion
8:01 parameters but if you were to use a
8:04 model like that such a model may have
8:07 relatively High latency meaning after
8:08 you prompted you might need to wait a
8:12 while to get back a response and if you
8:14 were deploying this on your own
8:16 computers it could be quite costly and
8:19 even though we said in the earlier video
8:20 that these models aren't that expensive
8:23 maybe want it to be even cheaper and
8:25 that's because a 100 billion paramet
8:28 model may take specialized computers
8:31 such as a GPU server or other really
8:33 fast computers to run you probably have
8:36 a hard time running such a large model
8:39 on a normal laptop or PC and certainly
8:43 not on a smartphone today but if you can
8:45 get your application to work on a much
8:48 smaller model say 1 billion parameters
8:52 then that's the range of model size that
8:54 they would run much more easily on a
8:57 laptop or a PC or on a mobile phone so
9:00 for example if what you want is to
9:03 classify restaurant reviews as positive
9:05 or negative sentiment this is a simple
9:07 enough task that you probably don't need
9:10 a 100 or 200 billion parameter model to
9:12 run but maybe a 1 billion parameter
9:14 model would be just fine maybe even smaller
9:16 smaller
9:19 frankly but these smaller models aren't
9:21 as smart or not as they aren't as good
9:24 as a really large models which is why if
9:26 you were to take a small model and then
9:29 fine-tune it on the data set like the
9:32 one shown here not just three examples
9:34 but maybe a few hundred or maybe a
9:36 thousand examples if you have that much
9:39 data then you can get a small model say
9:42 a billion parameters to do really well
9:46 on a task like this so to summarize
9:48 fine-tuning gives you another technique
9:50 in addition to rag to help improve the
9:53 capabilities of an LM you might use it
9:56 for tasks that are hard to specify in a
9:58 prompt such as if you wanted to Output
10:01 text in a style or if you want the to
10:03 gain a body of knowledge such as about
10:06 medical Nots or if you want to get a
10:09 smaller and cheaper to run L to do a
10:11 task that might otherwise have required
10:12 a larger
10:15 L it turns out that Rag and fine tuning
10:19 are both relatively cheap to implement
10:23 rag just is modifications of your prompt
10:25 and fine-tuning you might be able get
10:27 started with tens of dollars or maybe
10:29 low hundreds of dollars
10:31 depending on how much data you want to
10:34 find tune on there's another technique
10:37 pre-training your own model that turns
10:40 out to be very expensive and today
10:42 almost no one other than reasonably
10:45 large companies usually tech companies
10:47 are attempting this but for completeness
10:49 let's take a look at the next video at