This content provides a foundational overview of key terms and concepts in Artificial Intelligence (AI), particularly focusing on Large Language Models (LLMs), to equip engineers with the knowledge needed for effective communication and deeper learning in the AI space.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Hi everyone, this is GKCS.
In today's video we will see some of the commonly used terms in the AI space.
If you are an engineer who is building applications,
then you will find these terms useful.
When communicating with people within your team or outside.
And I think if you know these terms,
then it is also easier to learn the deeper subjects around AI.
So by the end of this video, you'll have a list of terms
whose definitions you understand quite well.
And I'll also be linking
some references in the description so that you can dig into them further.
Let's start.
The first term that you should know about
is large language model.
Also known as LM.
And the definition of this is a neural network.
That is trained to predict
the next term.
Of an input sequence.
For example,
if I pass in the query all that glitters
to a large language model, then
it's going to come up with the response of
is not going
okay.
At which point the complete response of all that glitters is
not gold is returned to the user.
What do we mean by training?
What do we mean by neural network?
As we go through this video,
you will be understanding these terms better one by one.
Okay.
The second term that we're looking at is tokenization.
This has to do with processing the input of a large language model.
For example, if all that glitters
is passed into a large language model,
the first thing it's going to do is break this into discrete tokens.
That is the process of tokenization.
The first token will be all.
Then there's a space character.
then that after which you have glitched.
And finally also
you might think, well, why shouldn't
you just break this into space characters and get the job done?
The humans do not talk like that.
We are, after all, trying to process natural language.
So ours is a common term.
Shimmers. Murmurs. Flickers.
These are terms which have the suffix of ers
which means that the action of glitched
is being performed by that object.
Another example of this is in.
So eating, dancing, singing
all have the suffix of ING,
and a large language model can look at this token of ING
and know that the previous action is being performed.
As long as you have the suffix.
Okay, remember, the core problem for the large language model
is to truly understand human language so that it can speak it really well.
Tokenization is an essential part of that.
Whose end result is that the input text is broken into tokens.
Which brings us to our third term
vectors.
Tokens tell you what you should focus on.
What is the smallest term that you can derive meaning from?
But what meaning has to be derived
is represented by vectors.
If the large language model can map a two dimensional
or a n dimensional space.
Such that
all the words which are close in meaning are placed close to each other,
then the benefit
will be that the meaning of these words will be turned into a coordinate.
In this n dimensional space.
This is called a vector.
Okay.
The coordinate.
The mapping of a word in a n dimensional space such that.
Nearby words.
Similar meaning words are all clustered together
and opposite meaning words are somewhere far away.
Comes through the process of vectorization.
The end result of this is that large language models
know the inherent meaning of all the words that are in the English vocabulary,
and they also know how to break it into small tokens.
Any input text into tokens.
Words which are similar to each other are placed close to each other.
Once they know the meaning, they can construct sentences effectively.
Okay, so now you have large language models
which can tokenize input text, convert them into vectors.
But there is one major challenge which actually change
the entire industry here, which made large language models very popular.
And that is attention.
We just said that
all the input tokens for a large language model are converted into vectors.
The vectors encapsulate the meaning of those words.
But what about the word apple
when you say it is a tasty apple,
you mean the fruit, the edible apple?
When you say apples revenue,
you probably mean the company.
And if you say the apple of my eye,
you are probably talking about a young person who you have affection for.
So Apple has different meanings,
and the only way to understand the meaning is not by looking at the word itself,
because that spelling is the exact same, but by looking at nearby words
which add context to the meaning of apple.
The moment I said tasty, you know that
it's some sort of food that is going to talk about.
That's how humans derive meaning, and large
language models can derive meaning this way.
Now, the way they do this is look at nearby words in a sentence.
Generate those vectors
so nearby
contextual
vectors are picked up.
And for ambiguous terms
you end up with ambiguous vectors.
But you can derive the exact meaning
by adding this nearby contextual vector to it.
So take the vector of Apple.
Take the vector of revenue when you add these two vectors.
When you perform some sort of an operation, it's
not a direct addition, but it's the attention operation.
You effectively take the vector of Apple
and you push it in the direction of the company Apple.
So Google
meta,
Microsoft are all here.
The first operation of vector revenue is going to send it there.
If you instead add a vector of tasty,
do this.
If you perform the attention mechanism of these two vectors,
then it's going to push the vector of apple to banana, chiku and guava.
Okay, so you can tokenize input text.
You can derive the inherent meaning of all of those tokens.
And for ambiguous tokens, for tokens which are difficult to understand.
You have a mechanism to add context by looking at nearby words.
And this
is another breakthrough that large language models have made.
This was in 2017.
The paper came out then, but
in 2022 this became really, really famous which are gpt2 being released.
The quality of responses
of a large language model far exceed anything else that we have seen earlier.
Okay, because it is able to derive contextual meaning,
it's able to construct sentences in a way that humans speak.
Okay, so now we know how LMS
can process input.
But how do you train them
to predict the next token?
Okay, here's where there was
a major breakthrough in 2017.
Basically the concept of self-supervised
learning.
Became very popular.
Self-supervised learning means that
instead of telling the model exactly what it needs to do,
the structure of the input data is such that the model knows what it should do.
Okay.
For example, you're watching this video right now.
I'm going to make a part of this video blank.
So 12345.
What do you think is being hidden right now?
What number is coming to your mind?
Let's see if that is right.
Yes, most of you guessed one because we went in the sequence
five, four, three, two, one.
Okay.
But when it comes to a video, you can also do something else.
Let me make another part of the video blank right now.
Where do you think the other AI is looking?
Let's check.
Most of you got it right.
Both eyes are looking upwards.
So what's happening is
a section of the input can be predicted.
Even if you make that section blank,
which means that there is inherent structure.
In your input
which your mind is able to replace with the expected token
or expected
output.
Now, the standard way to train such a model would be called supervised
learning, where you would have a human being say that
if the input text is all that glitters, then the model should
predict is not gold.
If the input text is at two,
then the output should be Brutus instead.
Self-supervised learning has made
getting test data much cheaper here.
If you have a two Brutus,
then the model is going to be fed in this text
and it's going to make three predictions. One,
what comes after it?
Two what comes after a two
and three what comes after it?
Two Brutus okay, no, humans are involved.
You had some text in the world.
Maybe you scraped this off the internet and now you're taking the model.
Look, I have three questions for you.
Tell me, what are the right answers?
So the model looks at these three puzzles.
They are all running in parallel,
and they try to make predictions.
So it the model might say now the model might say two.
The model might say something, but
you train the model that two is the expected response.
So if it makes a mistake then you penalize the model that increases loss.
And so the neural network weights are updated.
In the second task you have at two,
if the model makes the prediction of Brutus,
then you tell the model that this is great.
The weights don't need to be updated.
But if it says Caesar,
then the model has to be penalized.
And so the internal weights are updated.
In the third case,
if you predict a stop token like add to Brutus,
that's it, then you will get it wrong.
If it is a comma, you're right.
And if it's, then
maybe you're also right.
Okay.
What you're doing is you are looking at text,
which already exists in the world, and you're creating multiple
challenges for yourself without human intervention.
This is what makes the model self-supervised.
It might seem like a small thing, but this architectural decision
or this benefit of the large language model makes it really, really scalable.
In fact, most AI models now are moving to self-supervised learning.
Even image models like we discussed, are looking for
removing some patches of the image and trying to predict those patches.
The benefit of this is you understand the underlying structure
and the inherent meaning of those patches.
In the case of text, it's going to be terms.
In the case of images, they are a bunch of pixels.
And in the case of video you might understand how an object even moves.
Okay.
So that explains what self-supervised learning is.
Next is the transformer
okay.
And most people confuse transformer with large language model,
which is completely understandable actually.
But that's not the case.
A large language
model is something which predicts the next token given an input sequence.
A transformer does the exact same thing, but it's a specific
algorithm or a specific method by which you predict the next token.
A transformer basically is input tokens.
Being run through an attention block,
which is then forwarded to a neural network, a feedforward neural network,
and then you have a bunch of outputs.
Okay, you can think of these as output vectors.
These vectors are then passed in
to another layer of attention.
The first layer of attention, like we said, disambiguate terms.
The second layer might find more complex relationships.
It might find sarcasm. It might find implications.
For example, a crane was hunting a crab.
So in the first case you understood it is not the metal plane, it's a bird train.
But in the second one you might infer that the crab is fearful.
You might understand the crane is hungry.
So this is the second layer.
And then you have another feedforward neural network
and so on.
Till finally you are confident enough to generate an output.
Okay. So you have these stacked.
Sometimes they're stacked to 12 layers, sometimes more.
I think recent GPT architectures are in hundreds.
The main idea behind this is are
getting all of the meaning from your input tokens
and then manipulating them again and again
to finally predict what the next word should be.
This attention lock is order in square.
Okay.
You could replace this transformer in a large language model
with something else to model.
A new architecture could come in, in which case the transformer
and the state space models are gotten rid of,
which could be a diffusion model.
That constructs essays or text.
Okay, so the large language model is actually the product.
You can think of it as a car.
And this is the engine.
A car, many people say is just the engine.
But no, there are some other fancy things around it.
The internal algorithm can be different.
This term number seven,
it's fine tuning.
We said that a large language
model is something that is trained to predict
the next term.
Of an input sequence.
The question is what type of next token are we talking about?
If you are talking about a medical large language model, something which helps
doctors explain the diagnosis of a patient,
then you're probably going to be thinking of medical terms.
If you have a model which is trained on financial operations.
Then the same model
for the same query is going to think in terms of financial terms.
So the next token that the model comes up with is not always going to be general.
You're first going to train your base model.
In a self-supervised fashion.
Then you're going to take that model
and make it go through a series of questions
and answers.
This process is called fine tuning.
And goes something like
who is the president of USA?
Donald Trump?
But the model could also say,
I would like to know that too.
Here's where things are going wrong okay.
The model should not be responding like this.
Give us a direct answer or confess that you do not know.
Or you could say no.
But then this is also very, very bad because the models are trained
to be helpful.
Okay, so what's happening is other plausible responses
which are not wrong but are not desirable,
are penalized in the fine tuning process.
You have these questions and answers.
The fine tuning process forces the model to take a question
and give answers.
As expected.
So when it comes to a medical diagnosis, the model is going to train itself.
The internal weights will be updated in such a way
that it will learn to speak in medical jargon or medical terms.
And so this step, where a base model
is trained to answer in a specific way, is called fine tuning.
The same base model can be run through different sets of questions
answers to come up with multiple fine tuned models.
So the base model of Lamarck
can be fine tuned by a company
to answer that customer's specific queries.
A few short
prompting.
So the main idea behind future
prompting is before you send a query to a model, before you send a plain
vanilla query to a large language model and ask it to come up
with a response.
You augment the query.
You add more information by saying, look,
if the query is where
is my parcel?
Then let me tell you that there are some examples that I want you to go through.
This is happening during inference time during response time in production.
Right?
Life, your system, your server sends the original query and sends examples
to the model so that it
takes this into context and then gives an appropriate response.
The quality of the response goes up.
This is called future prompting.
It's basically example prompting
example in prompt.
That's it.
It brings us to point number nine which is very interesting
and is completely exploded, which is retrieval,
augmented generation.
In fact, the AI space is moving so quickly that people are saying
rank or retrieval augmented generation is already dead.
So the basic idea
again, is that you have a large language model
and you pass in the input from the server.
So a customer connects to you here they hit your API.
The server says, you know what this is the customer query
that is forward that to the language model.
Along with that let's give some examples.
So that's for short prompting.
And along with that, since there are some company policies
that I want you to know of, last language model, I'll give you those documents.
So in real
time the server goes fetches the most relevant documents.
Maybe your policy document,
maybe your terms and conditions.
When placing an order, and maybe many more things.
Right?
You send these documents along with examples of how you should respond.
This gives you a good idea of the format of the response.
This gives you a good idea of the company specific context,
and this is the direct user input query.
Okay, with all of this, the large language model tends
to give very high quality responses.
Now the question is where are you getting these documents from?
How does the server know which documents are related to which query?
There are many ways to do this.
If you talk to Neo Forger, which is a graph database company,
they will tell you you should store things in a graph. DB.
If you talk to neon,
then they will tell you that you should store things in a vector DB
and some people will say just keep everything in memory.
Just keep everything in cache.
This doesn't matter how you fetch the documents, doesn't matter so much.
Usually it's a vector DB by the way, because.
But I mean it is easier
to find relevant documents because you just do a similarity search.
Once you have the documents, you pass that to a large language model.
The large package model converts it internally into vectors
and then gives you a response.
Okay, but at a high level you just want to add more and more context.
You retrieve the context, augment the query,
and then generate a response.
The 10th term, which is vector database.
We just mentioned vector database is something
which is used to find relevant documents for an incoming query.
Let's see how that happens.
You have the request.
I am upset with your
payment system.
I expect a refund.
This is a lot of terms in this query.
A human being can read this and easily understand what the user is feeling.
They are feeling upset.
I mean they've already mentioned it, but they are looking for a refund
if you give them a refund, maybe the upset feeling will go away.
What do you do?
Which documents do you search for?
You could search for all documents where the word upset exists,
but maybe you do not have it in your company policy.
Maybe nowhere is it mentioned that a user is upset,
but you have a document which mentions
if the user is giving you a low rating,
or if a user drops off.
How do you make that decision that upset as a word,
is close to the low rating or drop off?
We spoke about vectors.
Vectors can encapsulate semantic meaning, which means documents which store
similar words are going to be similar
or close in distance.
Remember, vectors are basically coordinates, right?
So the distance between upset and documents having low rating are going
to be low.
You will fetch the documents which mention low rating or drop offs
and use them to add context to your large language model.
When you have an incoming query from the user.
You're going to find which document is closest to the query
and add that to the large language models context.
So this document will be sent along with the original user query
and maybe a system prompt.
Where are you going to store these documents
in a vector database,
which helps you perform these similarity searches efficiently.
Some of these algorithms are hierarchical, navigable, small world.
We have spoken about this in detail in the interview.
Right course at the end of the day.
The vector database is like a black box to you.
You can store documents and you can quickly retrieve them when you need them.
Great.
So you can store internal company documents and information
in a vector database to get context for a large language model.
But what if the context exists outside your system?
So this challenge was met with
model context protocol.
Okay.
As the name suggests, it's a protocol or a way to communicate
the transfer context into a model.
The basic idea here I made a detailed video on this.
You can check it out, but the basic idea here is that
you have a large language model
which, before receiving an incoming
query from a user.
Has a client,
an MCP client model, context protocol client
which forwards the initial query
user query.
The LLM now makes a decision.
It says that there may be external tools or databases
that I want to connect to.
The client gets to know of this
and connects with external MCP servers.
In one case, that might be Indigo.
In another case that will be Air India,
whose MCP server can give you details around Air India.
So you can think of this as a wrapper for Air India's database.
This as a wrapper for Indigo's database.
As a response, you
are going to get flight details.
From each of these airlines.
Once you have the details,
you can forward it to the alum saying that hey,
along with the user query and along with whatever system
from the relevant context that I could get from my vector database,
I'm also adding flight details, real time information from external servers,
which you can now consume to come up with a decision.
Okay.
And the large language model at this point might say, okay, book flight number
i.e. Indigo
1020, which then results in another
API call to book
on the MCP server of Indigo.
Okay.
The response final response is given to the MCP client.
The client then forwards it back to the user.
Result in customer satisfaction.
Okay.
You see that the user is no longer just able to get data up.
They do not have to do things themselves after being given the recipe.
The recipe can be completely executed by the MCP client.
Okay, so this makes LMS a lot more powerful.
MCP has picked up a lot of popularity now.
Okay, so all of this put together
is called context engineering.
If you are an engineer, you have probably heard of this term.
And basically this is
an encapsulation of many of the things that we have already discussed.
We discussed a few short prompting,
which is giving examples.
We discussed
retrieval, augmented generation,
which is getting relevant documents
from a vector database.
And using them to add context to a query
and using model context protocol to hit
external servers.
And perform actions as needed.
When it comes to context engineering.
This two new challenges that we are facing as engineers.
One is user preferences
and the second is prompt summarization.
You can call it context summarization.
For example, you might use a sliding window.
Where the last 100 chats
are sent directly to the large language model,
and all the previous chats are summarized into five sentences,
just.
This limits the max amount of chats
that you are sending to the large language model.
You could use other techniques also.
For example, some people just focus on keywords.
Some people focus just on the last chat.
So one chat and the previous entire history summary together.
The idea is to get context summarization this way
when you get a document, you again summarize it first and then send it.
So this can be done maybe using a cheap small language
model or a distilled model.
And once you have generated the context,
you send that to the expensive large language model.
You see, the main difference between context engineering and prompt engineering
is prompt engineering is for one single prompt.
It is stateless.
Anytime you ask the large language model to behave in a particular way,
the system prompt is going to be the same.
But context engineering evolves as per
the user's declared preferences and also the previous chat history
similar
to what it was earlier, but this is more long term.
Which brings us to the most long term thing you can come up with
in the air space right now.
Agents.
I've taken a detailed video on this, so do check that out.
But at a high level, you have a long running process.
Which is known as an agent.
You can think of this is a server which is getting an API call.
And this has many capabilities.
It can go and
query and LM.
It can also query external systems.
And other agents.
To meet the user's requirements.
Let's take an example here.
Let's say your travel agent can look into booking flights,
booking hotels and even manage your email when you're away.
When it sees a window of opportunity.
Maybe the flights then are cheap.
It goes ahead and makes the booking according to your preferences.
All of this stuff can be managed by an agent
and the most hyped
term here.
Is reinforcement learning.
It's a way in which you can train models to behave in a particular way.
So, for example, if you give a query a user query to the model,
the model can generate two responses
response one and response two.
You must have seen this in ChatGPT.
Choose the one which is better.
Okay, so the one which is chosen gets a plus one.
The other one gets a minus one.
What happened effectively is you took a user query.
This entire thing can be mapped to a vector.
And the vector is an n dimensional space.
So you go to that coordinate
and you tell the model that look after reaching here
you generated further tokens for the vectors.
So that's your path.
You went from here to here to here.
So this was the final point of response.
And now you got a score of plus one.
So this gets a score of plus one.
This also gets the score of plus one plus one plus one plus one plus one.
It's also discounting that you can do.
But for now let's just keep things simple.
This is a nice path.
You always want to follow this path.
Response two was bad there.
You followed this point to this point.
This point, and then you deviated.
The next token that you generated after the first three tokens, let's say,
is not going.
And then you did a comma here and went,
but it may be so token
one, two, three for token one, two,
four. Okay.
This was bad.
It got a score of minus one, which means this area gets a score of minus one.
This also gets the score of minus one, minus one, minus
one, minus one, plus one
Minus one plus one takes it to zero.
Minus one plus one takes it to zero.
So what you're doing is you have a space
where you have negative scores, positive scores and neutral scores.
If you
do this enough, then you will end up with a space, a vector space
where given an input query, given a starting point,
you will have a space of negative where you do not want to go.
You will have a space of positive where you definitely want to go.
And the more positive it is, the more you want to go there.
Okay, so maybe you go here.
From here you have another very positive space which is over here.
This is like hill climbing, right?
You're basically trying to optimize on the path
that you're taking as a large language model.
The expectation is that the final result will make the end user happy.
Okay.
If the end user experience is good, then the model is trained to make users happy.
That's
what is reinforcement learning with human feedback.
Human feedback is telling you whether it is a plus 1 or -1,
and the feedback is helping you reinforce
good outputs.
This is an extremely powerful technique.
In fact, it is seen in nature.
If you know about Pavlov's dog, then there was this
situation where Pablo would press a bell and
give food to the dog when it would come after pressing the bell.
Eventually he realized that if he just presses the bell without giving food,
the dog already comes and starts salivating because it's expecting food.
So its behaviors have been reinforced.
Fortunately, this is not the only capability that human beings have.
You cannot model human intelligence using just reinforcement learning.
I'll take an example.
Let's say you have a coin which is giving you heads.
Heads. Heads. Heads.
If you know that this is a fair coin.
If you have a mental understanding of how the coin works,
then what do you think is coming next?
Heads or tails?
Okay.
With what? Probability?
so I just looked at the camera and said, okay, okay.
Twice. Something's going on.
But as a human being, you should look at this
and say if it is a fair coin, if it's an unbiased coin,
then it can be heads or tails.
You can't guarantee that it is going to be heads next.
But reinforcement learning looks.
It observes the real world and based on that makes a decision.
So when it predicts heads it gets reinforced.
Great job.
When it predicts tails, it gets punished.
Bad job.
But the reality is this is a fair coin.
So there's a 5050 chance of either.
If you ask a human being, you show them the coin.
You tell them that this is a fair coin, and then you just keep flipping the coin.
You get a lot of heads.
They're just going to say 5050
because they have an internal representation of how the coin works.
They have a mental model of the physics of the coin.
While reinforcement learning cannot build mental models, they can just tell you
based on outcomes what is more likely and what is maybe a more beneficial path.
Okay, we are not crocodiles. We are humans.
We have a deeper understanding of how things work.
Having said that, reinforcement learning is a powerful technique.
It does make models get smarter.
Quite smart right?
Chain of thought.
Pretty simple concept, but very powerful.
When training the model, we clearly explain our thought process here.
The expectation is that as the model trains
to break a problem step by step, it's going to look at newer problems
with different parameters and still be able to reason through them
because it has been trained to reason step by step.
This is called chain of thought, where the model goes through
a series of deductions or inferences and comes up with the final response.
The quality of this response is usually much
higher than a direct response.
As you can see, this is similar to a few short prompting.
The quality of the response is higher.
It has some examples to go through, but here the key
difference is that there is a step by step breakdown, and new
steps can be added by the model as it sees fit.
Because it is trained on so much training data, it may be able to reason
to add more steps as the problem gets more and more difficult.
Okay.
In fact, this is something that has been seen by deep seek.
If you make the problem harder, it goes for more steps.
If you make the problem easy, then it goes for fewer steps.
So this is called a reasoning model.
Okay.
They do not necessarily need to do chain of thought.
They can also use other algorithms.
For example there is tree of thought graph of thought also that you can go through.
You can use tools also to come up with better reasoning,
but a model that can reason, a model that can figure out,
given a problem, how to solve that problem step by step is a reasoning model.
This is also known as L or M's.
Okay, examples of this deep seek and OpenAI.
I mean the O one and O three another.
All these models but they newer models
with new capabilities. Now
multi model
models. Okay.
So the basic idea
is that most large language models that we know of operate on text.
But what about models which can
accept and create images, generate images.
What about models
which can accept and create videos. Okay.
So they can analyze images.
They can tell you the number of apples in an image, let's say.
Or they can modify an image to create a new image.
Similarly for video, these have tremendous application
similar to how large language models have changed
the marketing space.
To textual content.
Now, social media is rife with large language model content.
Images are going to get better and better, and video can be a really big deal.
Because if you have celebrities.
Who can create video?
You can create ads through large language models.
Then the cost expectation of creating video is going to go down
okay.
This is already happening to some extent, but the quality of the models
are not very good.
Multimodal in general means any kind of mode
of input data.
It turns out that their performance is better than models
which are just trained on text. Okay.
They have a deeper understanding of the meaning of objects.
If you train a model on cat and feline
and so on, and then if you show it, images of cats,
then the performance of the model, the output quality is usually better.
Okay.
The training is better.
Fine.
Let's get to three major topics,
which is where the AI space is heading.
Okay.
People are looking for more company specific smaller models.
Foundation models.
The reason for this is companies want more control over what they generate.
They also want to keep the data close to themselves.
They don't want to expose it to any other third party company.
So one of the things which is happening is we are looking at smaller
Of small language models.
As you can expect with the words have fewer parameters
than large language models.
For example, a small language model may have 3 million to 300 million parameters.
Okay, the neural network internally has fewer connections, fewer weights.
But if you look at large language models, contrast it.
You have 3 to 300 billion parameters.
So this is a very large
neural network with a lot of weights in a LM.
But the SLM is smaller.
But they are useful because they are trained on lesser data,
which can be company specific.
Or task specific.
For example, a bot which is trained on just customer queries, how to manage
customer queries, how to make sales is likely to perform decently well.
Okay, it's going to be an expert at sales, but it probably can't tell you
a detailed weather analysis.
For most companies, this doesn't matter.
In the case of NASA. This is what you need.
You are probably not selling anything openly, so maybe you are.
Who knows?
But NASA would be more interested in building a foundation model
which can predict the weather, but not bothered about the sales part.
So in this way, smaller
language models are being trained by companies
on their specific data, on the proprietary data
to come up with reasonably good responses for specific use cases.
And the process of building small language
models is usually distillation.
The basic idea is
you have a large language model,
which is a teacher,
and then you pass in some input.
You look at the output of the large language model,
and in parallel you also send it to a small language model.
Okay, with fewer parameters.
You and it also tries to predict the output.
So the teacher produces an output and the student
tries to mimic the teacher.
If these two outputs match, then the small language model is doing well.
No weights need to change, but if it is not doing well,
then the internal weights of the small language model are changed.
But there is a limited number of weights assigned to this model 3 to 300 million.
What you are basically trying to do is condense this information, the
the complex neural network, into the most reasonable representation
that you can have
such that your performance is okay, but the costs are significantly reduced.
So during runtime, during production inference time
when you get a query, this is going to be much faster
at responding as compared to this large language model.
It's also easier to host.
Okay.
Distilled models take us to the last term that you really should know
if you are the engineer, and that is quantization.
Here the idea is
that you have neural networks.
Each of these weights
is basically a number,
let's say a 32 bit number.
What if you could take these weights and condense
that information into eight bits.
Then 75% of your memory is expected to be saved.
It doesn't directly map over here because the weights are usually just done
on the feedforward neural network.
You still have the attention mechanism, and also the training cost is the same
because initially you come up with a really good model with zero quantization.
Once the model is completely trained, that's when you apply quantization.
So the training cost does not reduce.
This is mainly to reduce inference cost
or during production.
The cost of running a model.
So these are the most important
20 terms that I want to discuss in the engineering space.
I think
knowing these terms will help you effectively communicate
with any other engineer or people in the team.
I couldn't go into enough detail here because, I mean, when you're talking
about the attention mechanism
or quick action, you cannot do this in a 20 30 minute video.
But the things you should know about are these terms.
And also most of the things
that are mentioned in the engineering course are going to be ready.
If you know them, then you truly understand how these models work.
And all of the hype and nonsense which is going on in this space,
they become hype and nonsense to you, right?
You are able to recognize it much better. Thank you for watching.
I hope you enjoyed the video. I'll see you next time. Bye bye.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.