This workshop introduces Retrieval Augmented Generation (RAG) by building a practical, from-scratch pipeline. It emphasizes understanding the underlying engineering trade-offs and complexities beyond introductory tutorials, aiming to equip participants with the knowledge to make informed decisions in real-world RAG implementations.
Mind Map
点击展开
点击探索完整互动思维导图
Okay, so let's get started with today's
workshop. I'm really excited for this
because uh I have planned for this
workshop for quite some time and this is
the first time I'm actually conducting
this live. So we have two sets of people
uh who are in the participants today.
One is who are already present in our
live classes and uh one who are just
specifically attending this workshop. So
So
I'm still admitting a few participants.
Yeah. And regarding the lecture
recordings, the I'm recording today's
lecture and tomorrow's lecture as well.
And we'll be immediately sharing with
all of you once the recording is done.
So no need to worry about receiving the
recording after the lecture. For the
people enrolled in the live classes,
I'll upload it to the dashboard as
always. And for the others, I'll send
you through email.
Uh so the reason I thought of conducting
this workshop is because the way I have
also thought about retrieval augmented
generation or rag has changed a lot over
the last I would say um last two years.
There is a question do we need to code
along with you? Yeah. Yeah definitely
will need to code. So I would highly
recommend not attending this workshop on
phone because
uh it's just not a good experience.
We'll be coding everything from scratch.
So u if you attend on phone then um
we'll be doing everything on Google collab.
collab.
So let me get started with the lecture
objectives. And when I say lecture,
we'll have two lectures actually within
this workshop.
Uh I'll first tell you what all we will
try to accomplish and then we'll start
going through each and every single
thing in detail. So I'll tell you my RAG
journey. Right? So RAG stands for
retrieval augmented generation. No need
to be scared by this name. The name
itself looks a bit complex like retrieval
U we'll see what all of these words
actually mean. But before that I'll tell
you my u
uh my experience with rag. So there are
some questions in the chat. What do you
mean by students attending live classes?
So there is a live batch which is going
on in the hands-on LLM series where we
have this is in the middle of our course.
course.
I will cover agentic rack but not in
this workshop in our subsequent lectures
of the live classes
Okay. So if you take a look at rack
tutorials, you'll see that there are
number of tutorials which pop up and
which are small. So there are some
tutorials which are 10 to 15 minutes in
nature. There are some tutorials which
are just 5 minutes. Uh there are
actually rack tutorials which teach you
how to build a chatbot in 5 minutes.
Then there are these 20inut tutorials,
25 minute tutorials. And uh when you
watch these tutorials, you feel that
okay this is simple. I have understood
what is retrieval augmented generation.
Um but it's actually not the case. Only
when I started solving industrial
problems then I realized that the whole
pipeline is actually far more
complicated than what is shown in these
introductory videos. There are several
things which no one ever talks about.
Right? So for example uh chunking.
Chunking is very briefly mentioned in
introductory videos, but no one codes
through chunking and actually teaches
engineers which chunks to use at what
time. If you don't know these
terminologies, don't worry. I'm going to
cover every single aspect in detail.
Then second is file parsing. In most of
these tutorials, it's already assumed
that you have the file, but in fact,
that's one of the most important steps
and frankly quite quite challenging.
Then another aspect which is neglected
is evaluation
uh which I'm calling as evals and in
industrial settings has become critical
because okay you build a rack pipeline
and you submit it to the client or for
your internal workflow but is it working
or not? How are you continuously
monitoring whether your rack pipeline is
delivering good results or not? And more
importantly, there is the question of
embeddings. Right? In all of these short
tutorials, they randomly use vector
databases or vector stores without even
thinking about why do we need to use
vector stores? Can we just do embeddings
in PyTorch? And what are vector stores?
Uh we are going to see all of these
today. In fact, at several portions of
this tutorial, I'm going to have this
section called engineer's choice.
uh this engineer's choice section I
specifically curated based on my own
industrial experience. So unlike all of
these tutorials, I don't want to tell
you that go ahead and use this, go ahead
and use that. But I'll be making you
aware of the trade-offs. Um and when I
say trade-offs, I mean how you should
select the tool for your particular use
case. So my goal after this workshop is
that when you face these trade-offs in
industry or wherever you want to
implement it, you should be in a
position to decide what's the best tool
for me.
I'm going to show you the different
trade-offs which I encounter in our
industrial problems.
Um and then so we are going to assemble
a whole rack pipeline from scratch. And
when I say from scratch, we are not
going to use a library like lang chain
or lang graph today or tomorrow because
all of that will seem very simple to you
after going through this workshop. We
are going to code everything from the
ground up.
Um and while doing that we'll see what
are the different engineering choices
you need to make. While doing that I'll
also show you the different packages and
libraries which are coming up which are
useful in industrial settings. So look
at this workshop not as a toy series but
rather as an industrial level workshop
where in when you go to industry things
are not white and black right they are
gray mostly there is no right solution
but the engineer who stands out is the
one who can figure out what's the best solution
solution
for the given problem that's what I want
to teach you so if you have any
questions at any aspect right ask me my
goal is for you to understand the nuts
and bolts of this rack in detail. So it
should not just be a terminology which
you think that rag okay rag is easy I
can cover it in 10 minutes
after this all of you will be able to
build chat bots and hopefully you will
be able to understand the trade-offs
when we build pipelines.
So what's our end goal? Our end goal
after this workshop is that we want to
build an application such as this where
uh if you see this is a rag based
nutritional chatbot and it's built
entirely from scratch.
There are also two types of rag systems.
One who directly provide the answer and
one type of a system which when it
provides the answer it also provides the citations.
citations.
So we are also going to look at how to
provide references, how to provide
citations and and what does it mean when
it says 56% match or 55% match. The
front end we will not spend too much
time on the back end and front end
coding. We are going to do it through
lovable. So all of us might get
different looking websites at the end of
this workshop but that will be the fun
of it right. We'll we'll share the
websites which all of us have obtained.
U yeah and so the lecture will be such
that I will explain many aspects through
a whiteboard
and then there are several code files
which I have designed. All of these code
files I'll share with you at
specific intervals
within this workshop. All of them will
be on Google collab. Tomorrow we are
going to use
um tools. The only external tools we'll
need is Superbase. How many of you have
heard about Superbase by the way or used
it before? It's fine if you have not
heard of this tool. I'm going to show
you what it is because it's used it a
used a lot in production level settings
these days. Uh Superbase is one tool
we'll need. And second is lovable.
for everything else. I believe that even
if you have the T4 GPU which is
provided for free through Google Collab,
you'll be able to follow along in this workshop.
Uh as the
fundamentals or as the I should say
guiding principle of
most of this lecture.
uh the prompt engineering
rules which we saw in some of our
previous lectures are going to be important.
important.
So let me just introduce the seven key
elements of writing an effective prompt.
First is so the rule is called P IC F A
TD which is basically when a prompt you
have to define many things instead of just
just
u writing a quick prompt. The first is
persona then you have the identity then
you have the instruction then you have context
context
then you have format then you have the
audience the tone and finally the data.
These are the seven key elements
u of an ideal prompt and we are going to
use this in when designing rack
pipelines. In fact, the base of
everything which is to follow such as
RAG, then later we'll look at agentic
workshops, we'll look at MCP. The base
of all of that is a good prompt.
U so please keep these seven things in
mind when writing an effective prompt.
Don't just write something quick. And
I'll stress that today when we are
building this uh this chatbot pro this
chatbot project also it is extremely
important that you spend spend time
writing a prompt because think about 5
years into the future right if English
is going to be the new programming
language then prompt engineering
then I'll take some questions in the
chat but before that let me tell the
philosophy of this workshop the The way
I have designed this workshop is that it
will cover all these three aspects. Uh
it will cover foundations that is the
most important according to me where you
should know the nuts and bolts of the
entire rack pipeline and you should be
able to make engineering decisions when
the time comes. Second is practicals. So
at various places I'm going to give you
practical insights uh regarding which
chunking strategy to use, which
embedding strategy to use, how to deploy
the rack project that comes in
practicals and then finally I'm going to
leave you with research questions or
research directions. So after this
workshop, I'm going to show you what are
the open research problems in this area
which you can now immediately start
working on uh after these live sessions
are done.
There are let me take questions in the
chat. So Prashant has asked this might
be a question for later but should
everyone move from plain vanilla rack to
agentic rag? Uh in production setup we
are only seeing 55% accuracy with
standard rag. That's a good question
Prashant. So I'll tell you my experience
from industry right so far we have done
around 16 industrial projects. out of
those 10 projects have been rag based
projects and in that we have been able
to satisfy the customer with a pure rag
pipeline and when I say vanilla rag it's
not just a simple you upload a PDF you
query from that PDF and you give it to
the LLM in that pipeline which we
designed for the customer we did not use
agents but we used many modern things
which grounds the responses I'm going to
share that that knowledge also So today
but vanilla rag is working for problems
which are not too complex in my opinion.
So let's say an industry comes there are
many chatbot requirements in industry
all of those can be solved with
vanilla rack then not just chatbot there
are some level two requirements which is
basically a company wants to make a code
generator based on their docs that can
also be solved by rag.
Agentic rag
plays a very crucial role when you want
access to external tools or when you
want to do something complex. Let's say
a company wants to build its own deep
research agent.
uh that is a difficult thing and their
traditional rag won't play an important
role but at least in my experience over
the last one year the one of the main
reasons I thought of taking this
workshop is rag is still very relevant
and vanilla rag for many company
problems of level one and level two
let's say chatbot generation uh code
generation based on what they have etc
then another question in the chat is
when will the lecture notes and uh
recordings we uploaded. So lecture notes
and the recordings I'll share after each
lecture is done. So after the first
lecture I'll share it on each
participant's email the link to the
whiteboard notes and the link to the recording.
Agentic rag is about yeah agentic rag is
basically I I'll explain that to you but
in a rag pipeline you have access to
embeddings right so think of embedding
store as a tool
if you start thinking of embedding store
as a tool then it suddenly becomes an
agentic pipeline where along with all
the other tools the agent also has
access to the embedding store or the
vector store.
Would you discuss query transformations?
Yes, I will discuss that towards the end
of this workshop.
Plain rags would they help in
application modernization like smaller?
Yeah. Yeah, definitely. If you have
let's say one common application which I
would like to share with all of you is
ITSM tools. How many of you are aware of
In information technology service
management, right? So if you look at at
least India, there is a whole middle
layer of companies which operate in the
ITSM space where they provide
uh let's say you are on Razer Pay and
you make a payment through Razer Pay.
What happens on their back end? How is
the payment stored and processed? That's
essentially information technology
service management. If you are on book
my show and you if you book a movie
ticket what happens on book my show
server how are they managing different
clients which are booking that's usually
managed through an ITM company so these
companies usually provide dashboards to
players like Zomato book my show players
which need an IT infrastructure those
dashboards are based on legacy systems
or traditional pipelines for a very long
time and now they want to integrate chat
bots within those dashboards
For these type of integrations, rack
systems will still play a very crucial
and an important role
because they usually have a fixed data
Llama index. Yeah. So, llama index and
lang chain, lang graph, all of them can
implement rack pipelines very easily.
And if I were to make a tutorial on
llama index using rack that will
probably be a 25 30 minutes tutorial.
But my main aim here is to build such a
strong foundation that after this any
tool will seem very simple to you.
Whether it's llama index, langraph, lang
Um so let's get started now. So I've
used this terminology rag many times so
far, right? And those of you who don't
know or have not heard of this do not
worry about it. Uh I'm going to motivate
it in a lot of detail.
So for the purpose of this workshop we
are going to consider imagine that we
are in the nutritional domain. So the
document which we are going to consider
is this 1200page document on human
nutrition and I'm going to share this uh
uh
this drive link right now with all of you.
We will see what these different things
are. We don't need them right now for
now. All you can do is that when I'm
showing this PDF to you, you can go
ahead and download this PDF from this
drive link which I have just shared on
the chat so that you can also refer to
this PDF along with me as I go along.
Now I want to ask all of you a question
right? Imagine that you are working in
an industry
uh you are in the engineering team and
you are in the you are in a meeting with
a client. The client has started a
nutritional startup, right? And they
want to spread awareness about nutrition
globally and for that they want to make
a chatbot.
Okay? And they want to make a chatbot
which looks something like this. So
essentially a customer will come, a
customer will log in and a customer will
ask some questions and the answer which
will be generated
has to be very specific and has to be
very grounded. I'll use this term
grounded a lot. What does grounded mean?
Grounded with whenever someone says
grounded, you should ask grounded with
respect to what? So this startup wants
its answers grounded with respect to the
encyclopedia of knowledge which it has
which is basically this book for now.
It's a 1200page PDF
uh about human nutrition 2020 edition
and it talks about a huge number of
things starting from basic concepts in
nutrition to uh human body then it goes
to water and electrolytes whatever it
covers every single thing about
nutrition they want their answers
now this same example I'm taking of
human nutrition you translate it to
other domains as well. If you want to
make a chatbot for customer service,
there will be a manual for customer
questions and what should be the ideal answer.
answer.
Uh if you are making a chatbot for this
ITSM, there will be a question, there
will be a manual of tickets which
customers usually raise and a sample of
the solutions. Now my question to all of
you is that let's say you are sitting in
that meeting as an engineer, right? And
this client comes to you with this
request of making a chatbot. Forget
about rag or this terminology of
retrieval augmented generation. Let's
think from first principles. How will
How how exactly will you build this
that's the goal, right?
The goal is to build a nutritional chatbot.
chatbot.
But what's the
key terminology which I mentioned? It
should be grounded.
It should be grounded in factual
knowledge based on the book which I just
shared with all of you. How will you do
that? Add the document content somehow
as a prompt. Use the PDF and pass it to
the LLM.
Okay. So what Madusan has mentioned
let's say you don't know all of this
there that is the rack pipeline you are
mentioning I'm I'm asking you to think
from first principles where forget all
of your knowledge let's say the only
thing which you have let's say is that
you have a chat GPT right you have
access to chat GPT or any LLM for that matter
matter
let's say you have this and that's the
add the PDF in the context of the LLM.
Instruct to answer information found in
the PDF. We will load data to chat GPT.
Okay, so the simplest thing which many
people are saying and which is aligned
to the data portion of this prompt,
right? I showed you this seven steps of
the prompt and there is this data
portion over here which is where you
usually feed the data and which is where
we usually ask the question.
So many people are saying that okay this
seems like a simple enough task why not
I just use
there is also an answer from Prashant
about keyword based search
so keyword based search okay that can be
done so but you want to use a modern
approach so you propose to the client
that hey this seems like an easy thing
to do we just make a front end
we just make a front end
and that front end looks something like
this. This is the human query. This is
the answer. This is the human query.
This is the answer. So human query I'm
mentioning by HQ and this is the answer
human query and HQ.
So you you start thinking from the front
end then you think that whenever the
human query is asked you pass it to an
LLM directly. Whenever a human query is
asked, you pass it to an LLM
like chat GPT and along with this you
also pass the PDF
Then you make this API call to the LLM
That answer now you show in the front end.
end.
Then the user asks another query. You
again make an API call to the LLM. You
again pass the entire PDF in the context
of the LLM and you get the answer.
That's what would have naturally come to
my mind if I were thinking from first
principles and if I did not know
anything about uh
uh
retrieval augmented generation really.
But what are the issues with this approach?
approach?
Can you try to think that as an engineer
you go back and you try to implement
this now what will be the issues with
this approach so Amit is saying high
cost Samarat is saying too many tokens
context length too many tokens so let's
actually see this right and I encourage
all of you to do this in practice go to
chat GPT
I went to chat GPT right now and I have
asked I put this exact same PDF and I
asked what are the number of tokens in
this document
What are the number of tokens in this
document? Does it fit your context
window? What is context window? Context
window is the number of tokens which a
language model can look at at one time
before producing an answer. So think of
it something like imagine
you are being bombarded with information.
information.
Let's say someone tells you about one
topic then the lecture goes on for 2
hours, 3 hours, 4 hours, 5 hours. At
some point you will start losing
information right context window is the
amount maximum amount of information you
can fit in at one time and produce
coherent answers.
Now whenever an LLM like GP is designed
the context window is fixed.
Um so if you ask something if you put
this document and if I ask what are the
number of tokens in this document does
it fit your context window. So number of
and the context window of chat GP and
I'm using GPT5 here it's around 128K.
So here we see that the entire document
does not fit into memory at once.
And what will happen if the entire
document does not fit into memory? If
you ask a question let's say related to
uh if you ask a question if the human
asks a question related to this chapter
nutritional issues and if in the context only
only
tokens up till page number 700 are going
to be filled then there is the entire
context is lost the LLM will not be able
to answer correctly answers will be
wrong and then what will the LLM do you
know what the LLM will do. After this point,
the LLM might start to answer from its
own knowledge of pre-trained
information. The LLM says that this this
document does not fit in my context
window and I don't see the relevant text
in my context. So, I'll use my own
pre-training data. That's fine. I don't
need to rely on any document. And when
an LLM becomes overconfident like that
and starts thinking like I have data
from my own knowledge or my own corpus
that leads to one of the major problems
which retrieval augmented generation
actually solved and that's the problem
did not fully solve it but was a good
step in that direction. So if you if you
pass the entire PDF at once it might
exceed the context window of language
models and that might lead to hallucinations.
hallucinations.
What is the solution to this problem?
The solution to this problem came with
this paper which was released in 2021
and the solution to this problem is
Now
you can read through this paper
definitely but the idea of retrieval
augmented generation is very similar to
an example which you you you all know.
Let's say you you have also been given
this text. You have been given this text
on human nutrition
and you have an exam but that's an open
You have an open book exam. So let's say
I hope all of you know what an open book
exam is right in an open book exam you
you can put the book in front of you and
you have access to all this material.
So you have access to the entire book
actually in an open in an open book
exam. So you are sitting in that lecture
hall and you see a question you see a
question related to
proteins let's say
how will you answer this question at
that point can all of you try to think
about it if you are sitting in that open
book exam and you have been asked a
yeah go to index find the topic
and find it in chapters index. Yeah. So
what all of you will probably do is that
you will you will look at this word
proteins then you will go to this PDF
from start. You will maybe look at uh
the chapter of index or table of
content. If it's not there in the table
of content you will go through all the
pages and you will try to find that page
where this particular information shows up.
up.
Then you will highlight that
you will highlight that information. You
will use this knowledge. So the question
which might have been asked might not be
completely related to this information
but you will use that information.
from the book. And plus another key
component which of course you need is
So your own mind already has some
information right because you might have
studied for this exam. On top of that
you will get some information exactly
based on this book contents and then you
will get the answer.
Now this entire pipeline is very similar
to what a retrieval augmented generation
is. retrieval part is this part retrieval
and generation part is this part.
So if you are not fetching context from
this book and if this whole thing was
not there. So let's say
I will move my screen now to this.
Let's say if only this was there. That's
just the generation part. Right? But now
you have augmented the generation part
with some sort of retrieval from this document.
document.
That's where the term retrieval
augmented generation actually
There is a question on the chat that do
you plan to share your screen? Is my
It's visible, right? Okay. So, I guess
it was frozen for some time whenever I
go to this prompt engineering book. So
Yeah, now I'm back to my main screen. So
we retrieved context from this document
and we also generated answer from our
own own mind. That's retrieval augmented
generation. How does it translate in the
case of the startup app which we
discussed the mind here which I've
mentioned is the LLM with its own
pre-trained knowledge and instead of
passing the entire context to the LLM we
only pass context which is relevant and
instead of using the word pass a more
fancy word is retrieve we only retrieve
context from that PDF which is relevant.
So instead of this earlier pipeline
which we saw, what if we make a
different pipeline? What if the pipeline
now is something like this? We still
have our front end and this is the human
question and this is the answer, right?
So when the human asks a question, it
will again go to the LLM. That is fine.
But the LLM will also somehow magically
get only that piece of context which is relevant.
relevant.
And now that's the retrieval part. This
relevant context is passed to the LLM
Do you see the problem which this will
solve? We we started out with the
context problem. Right now we don't have
to pass the entire PDF into the context.
We only pass the relevant bits of
information. What are the relevant bits?
the same bits which as a student we
highlighted over here when we were doing
this open book exam that's the relevant
bit of information which is passed to
the LLM so the context window problem
will be solved the natural consequence
of the context window problem being
solved is that the LLM will now produce
answers which are more factual the LLM
will now produce answers which are more grounded
grounded
uh in reality based on the exact
document which the client has given to
me based on this exact document which
the client has shared with me my answers
now I can be sure that they can they
will be specifically tailored. So when I
ask some question here
and when you see the answer being
printed on the screen you will also see citations.
citations.
Yeah. So these citations actually refer
to what portion of the document
this generation is referring to. See
this this thing directly comes from the
document itself on page number 592.
This comes directly from the document on
page number 53.
So you are retrieving relevant pieces
from the document from various places.
It does not need to be from one place.
You are passing it to the context of the
LLM and then you are generating the answer.
Okay. So that's the whole concept of uh
uh that's the whole concept of rag. So
if you have any questions please ask
I'll be taking all the questions through
the chat since the size of the room is
quite large. This is the whole concept.
So I just want to make sure the ground I
mean the stage is clear when we move to
the next part. So one of the teaching
philosophies which I follow is that
before explaining anything you need to
understand the context behind it. So I
know many of you might be wondering
about the details here right like how do
we get the relevant context
um which LLM are we going to use? Are
you doing using open API key? So let me
LLM I'll come to that. That's again an
engineer's choice. We can use an open
source LLM and a closed source LLM. I'm
going to do both. I'm going to use an
open-source LLM. So, we are going to
deploy a local rack pipeline
and I'm going to use a closed source LLM also.
Uh Jiny Gems uses rag. Yes. In fact,
many of these players like Perplexity,
they have a rag pipeline underneath.
There is a question by sankit. What if
the question is like summarizing the
whole document? Wouldn't it have to
parse the entire doc? Yeah. So for
summary there are multiple other things
which we can do. For example, if you go
to Gemini right uh and I also encourage
all of you to try this.
How many of you are aware of the context
You must be aware of this right. So
actually what I did is along with chat
GPT you pass the same document to Gemini
and what Gemini says is that this
document does fall within my context
window because apparently it has context
window of the size of millions.
Now for Gemini such a thing might
actually work because the context window
is very large
and there are many reasons why how
Gemini has improved its context window.
If anyone of you is interested in that,
uh I think the answer lies in this blog
uh which is also a book by the way.
Yeah, just check this if anyone is
interested in that. Anyways, that was a
digression. There are multiple questions
in the chat related to retrieval from
multiple documents. So whatever I have
shown you right now, right, it's just
one document. You can retrieve from as
many documents as you want. It does not
need to be constricted to a single document.
document.
uh I'll and then there is a question
that where do we can we retrieve from a
database or other format like images we
can I'm going to come to that when I
come to the injection pipeline
data injection pipeline so Samir has
asked hallucination is due to large
context so hallucination can happen due
to multiple things in this case there
will definitely be hallucination because
of large context because the whole PDF
will not fit in the context window so
The LLM will have to rely on its own
pre-trained knowledge. So the answers
which it generates won't be grounded to
this document. That's why we call it as
In case of rag application, how
important is the quality of LLM?
Extremely important in fact. But again
there is a trade-off here Amit. What is
the trade-off? The trade-off is with
respect to what the organization values.
If the organization values privacy, you
want to have an open-source LLM on your
own server. Uh we are in fact going to
use an open source LLM on our local GPU.
I I will come to the trade-offs when we
come to the engineer's choice section in
How important is quality? So this I
already answered is the data if the data
is in tabular form. I'll come to data
part just right now. So all of you who
have questions about the data format,
that's the next point which I'm coming to.
to.
Does rag help in improving named entity
recognition? 100% it does. Um
and in fact for named entity recognition
you have to do chunking in a very
specific manner. We did an industrial
project recently which had named entity
recognition. For that you'll have to do
chunking which is called as structural
Okay. So there are many questions which
I will slowly start answering. Many of
these questions will become clearer. But
one thing which I do want to address is
what was rag in 2021 and what is RAG
now. So in 2021 retrieval augmented
generation was this cool new thing which
had come to prevent hallucinations and
it's still relevant
because it still solves industrial
problems. But now just zoom out a bit
and take a look at retrieval augmented
generation in context of something which
is called context engineering. So now
there is this new field which is
emerging which is called context engineering.
engineering.
We talked about context a lot right in
rag and already I mentioned that as the
context window of LLM is increasing
for example what if the context window
of all LMS becomes 5 million
it might happen that in the next 2 years
why because you can just pass the entire
PDF to the LLM
again there is a trade-off like even in
Gemini I would not do this. Why I would
not do this? Because Gemini charges you
per token in the input and per token in
the output. If you pass let's say 100
PDFs, even if the context window is
large, you will incur that much
prohibitive cost. Right? So even if the
context window of LLMs becomes large,
rag will still be valuable to reduce costs.
costs.
Although LLM can handle it from a
performance point of view, it's still
not in your best interest to pass the
full document. It's like using an
elephant to kill an ant.
Although you can do it, that does not
mean you should do it. It will be costly
for you for every single request. Why do
you want to pass all the documents?
You'll get charged per token.
U but now think of rag in the context in
this within this umbrella of context
engineering. In the last class we have
discussed prompt engineering right
that's intimately connected with rag and
one more thing which is intimately
connected with both of this is memory
um essentially if you are interacting
with this chatbot this chatbot which I
showed to all of you
if I'm interacting with the chatbot
let's say user logs out and comes the
next day how does the LLM know what
conversation has happened yesterday
imagine that I go to a nutritionist,
right? I go to a nutritionist and I ask
a question or I ask multiple questions.
I have a 1 hour session and I go back
again the next day. The nutritionist
will of course remember the thread of
our previous conversation
or the therapist if you go to a
therapist they of course have to
remember what has happened in the past.
So when you talk about context
engineering memory becomes a very
crucial role. Here also there is a
trade-off. The more memory you save for
an LLM, the more context it has, the
more again cost. The cost increases, the
context size increases.
But when you think about rag these days,
you have to think in terms of what's the
context window of the LLM, do I really
need rag? Okay, if the context window is
large enough like Gemini, I don't really
need rag, but I I still can do it to
save costs, then how much costs can I
save? Why have I mentioned prompt
engineering here? Because the success of
your rack pipeline also depends on how
Sanjiv is saying can you explain context
engineering? Yeah. So if you think about
context engineering
the best way to explain context
engineering is if you want to make a
production level app like a rack chatbot
how are you going to manage different
aspects which show up in the context.
What are the different aspects? One is
of course the information retrieved from
rag. One is the memory. One is your
current state. Then second where are you
going to save this context? Are you
going to save it in a vector database or
are you going to save it in a normal
database like Postgress? Where are you
going to save embeddings?
I'm I will come to most of these issues
in this workshop. But context
engineering is a much broader field now
and 2025 rag has evolved now in the in
four years. Now we think about rag in
terms of context engineering. The main
field is context engineering and then we
start to think that okay with this
context of the LLM that's this is the
application what's the best thing which
I can do should I do rag uh should I
just do few short prompting
by passing the whole PDF how will I how
am I going to save memory can I save
should I save all the conversations as
it is or should I save a summary of the conversations
conversations
uh think about this right if you talk
with someone
and if the conversation is for 1 hour
after 1 hour what do you remember you
don't remember exactly what that person
said right you remember the summary of
key points which your mind automatically
forms so you can use another LLM to summarize
summarize
so it is best practices around context
yeah And when you say context it means
many things. It means memory. It means
will rag be relevant in the long run
lms themselves. Yeah. So this I think I
already answered Amit. Let's say if you
are JP Morgan, right?
Uh and if you want to make a chatbot
specific to your data, I think rag will
still be relevant because passing the
entire PDF each time will be
computationally and costwise prohibitive.
Okay. So let's table the questions for
some time because now what we have to do
is that we have to get started with the
first pipeline.
Many people have asked questions about
document prep-processing and I want to
spend some time here.
Uh this is the whole pipeline which we
are going to build in this workshop. By
the way,
let me walk you quickly through
different elements. We are going to
start with this nutritional PDF. Then we
are going to do chunking. So this is the
I'm going to have a section on this.
Then we'll have a section on chunking.
Then we have a whole section on embedding.
Then we have a whole another section on
LLMs whether open source LLM or close
source LLMs.
And then finally we'll put all of this
together and run everything on a local
GPU. After this is done, we will do
production level rag and build this
website. So we do have a number of
things to cover. uh the pace at which we
are going I'm not really sure how much
time uh this workshop is going to take.
I'm I'm very happy to answer all the
questions but from your side also
uh please note that it may take more
than 3 hours
because we have to do all of these
parts. So let's see actually let's let's
let's take a call based on how much we
cover today and how much we are able to
cover tomorrow.
Okay. So the first step is data
injection right and this is often the
most neglected step in tutorials in
video sessions everywhere because it's
not very cool. When I say cool, everyone
talks about embeddings, LLMs, but the
part which I think many should
definitely be talking about is how are
you going to collect the data and how
are you going to store the data?
Um, so let me ask all of you right if
you have this PDF, how will you collect
this PDF so that a Python interpreter
knows what to do with it? How will you
open this PDF and how will you read this
PDF in code?
So we need to store ingest the data
somewhere. Our LLM is going to look at
that data and then answer questions.
But right now it's in PDF format. We
humans can see it, right? But a Python
code needs to understand it.
Uh so someone is saying PDF to text
pain of document parsing.
Yeah. So this point pre frame which you
have mentioned I'll come to that. Yeah
we will use a Python library basically
to do document pre-processing. What
document pre-processing is is
essentially downloading and reading
PDFs. Now in this section I want to talk
about three things. I want to talk about
the document which only has text.
Then I also want to talk about documents
And I also want to talk about documents
So if you have a simple document right
which is in a PDF format you can use packages
packages
to load the document and one popular
package is pi mu PDF. So I'm going to
show you several packages and the way we
decide for problems which one to use. So
this workflow which I'm giving you right
now is exactly what we do internally
when a problem comes. So check this package.
Actually let me show the GitHub version
of this package.
So Pyu PDF is a traditional Python
library for data extraction of PDF
documents. really very robust library.
What it does is that using this library
you can essentially pass in any PDF.
U you can pass in any PDF and then you
can open the PDF using py mu PDF.
Um using this library you can also read
different data.
When I say read different data you can
read different pages. So for example
this entire PDF can be ingested by this
library and then we can save what
information is there in every page. Now
let me ask you this question.
Uh let's say this image comes this image
comes what do you think this PDF
extraction library will do at this point?
So all text skip it. So I'm looking for
a specific answer. So first let me ask
whether this it will be able to deal
uh those who are answering no that is
not the correct answer.
Uh it will be able to deal with this
image because this is a digital image.
So there will be an image tag associated
with this image and this image will be
downloaded as an image format. But the
Let's say you get an image like let's
say there is a restaurant bill and
someone takes a photo of this and uploads
uploads uh
uh
uploads somewhere. Let's say someone
takes a photo of this and uploads. Will
the library I'm showing with you deal
It will not read this type of im this
type of an image and that is one key
thing to understand. I mean it will save
this entire thing as an image but it
will not read what is the text present
on this image unless the text is typed
through a digital form. If the bill is
generated through a digital software
and every field which is entered is a
digital field
uh that will be taken into account by a
tool like this.
But if you have a image which has some characters
characters
so what do you mean by digital? By
digital I mean if you uh let's say if
yeah if I go to an invoice invoice
software tool
I I fill entries over here and I
generate PDF from this tool that will be
digital entry because every number is digitized
digitized
then a standard PDF extractor can also
read that number can also see what is
mentioned here if it's digital we can
copy text from the PDF but if it's not
digital like this we cannot copy text
from this PDF or at least normal Python
libraries cannot that is where we need
libraries which can deal with something
called OCR
so the best open-source OCR library you
can also see from the number of stars
which it has is tesseract
uh tesseract is one of the most popular
OCR libraries but before introducing
this libraries all of you should know
why OCR is needed in the first place for
the current PDF which we have many of
you gave the wrong answer here right we
don't need OCR library here this is just
a simple image there is no text on this
image it a simple PDF extractor can deal
with this image you need to use OCR
libraries only when you have images
which which are let's say handwritten
text images which have been scanned and
uploaded into a document which might be
the case for many clients so
so
that is the
that is the place when you need tesseract
tesseract
to tesseract can extract
handwritten text digitally scanned text
etc tesseract can extract text from
images like is
that's the second option which I wanted
to show you in this data injection
pipeline. Let's see how would the text
extractor know if the image is text or
not. It would not know, right? It would just
just
you mean the tesseract.
The tesseract knows it because the
libraries which it uses specifically
looks for text in that image. But if you
use pi mu PDF, it will not know. It will
just save that entire image as one image.
Oh, the fruit image, right? It will not
know whether there is text or not. So,
even if this image has a text, pyu pdf
will save it as an image,
but we will not know what is the text
written on that image. That's the main
issue. The image will be saved. That's
not an issue. But all the information on
that image, that image will be accessed
as a whole body. There will be nothing
like there are characters in this image
or there is text in this image. the
granularity will be lost if you don't
Then comes there is a question of
tabular data, right? How do you deal
with uh tabular data? For that what I
want to really do is introduce a third
library which has now become extremely
popular and I would say it's
hands down one of the best libraries for um
um
language modeling tasks. How many of you
Yeah. So dockling is relatively new.
It's I think newer than all of the other
libraries which I showed to you but they
have already so we have used docklink in
our industrial projects and it's
amazing. One reason why docklink is
amazing is because it is specifically
meant for generative AI. What do I mean
by that? So whenever dockling encounters
a table in a PDF the tables are saved as
real tables. Rows and columns are preserved.
preserved.
uh dockling can even convert a schema
into a JSON format directly
and if anyone of you has used language
models in production before you know
that it's very important for you to
retain certain elements in JSON format
or in markdown format. So if you
encounter a table somewhere or if you
encounter uh any schematic or a schema
that can also be analyzed by dockling
and that can be saved as a table.
Further dockling can be linked with an
OCR tool as well. Doc dockling can be
externally linked with an OCR tool like tesseract.
tesseract.
So you have the OCR capability as well.
You can
extract tables very easily. You can
extract schemas very easily.
In fact, if anyone of you is interested
this is the Dockling technical report
where they actually mention exactly how they
they
u they manage to retain tables in an
What happens if the input document has
text, images, tables and images with
text, right? Then exactly what you will
do in that case. If your text document
is extremely messy and if it has images,
if it has tables, if it has um
um
let's say images, scanned copies, then
you can use dockling and you can use OCR
along with it.
If your document is very simple like
what I have, you can use pyu pdf. If
your document does not have too many
tables but just has scanned images, you
can use tesseract.
So this is the first engineering
engineer choice section which I have. I
mentioned at the start that I will have
this section for all these parts right
data injection, chunking, embedding and
open source or closed source LLM.
So this is the first point where I have
this section called engineer choice.
When given a project, what document
processing tool are you going to use?
That's the first thing which you need to understand
understand
and that depends on the type of
documents which you really have. Now
actually before this there is one more
step which I have not discussed and that
is related to scraping.
So it may happen that some websites if
you go or some client websites have PDF
which you can download but in some cases
the data is not in PDF format. So you
need to first scrape that entire data
and then use these processing tools
which I have just mentioned to you right now.
How can we have hybrid pipeline with all
these three? So Samrat if you want
hybrid pipeline right then the best is
to use dockling with an external OCR
tool if you go to dockling in the
document itself they say that they can
handle diverse format that's good um
they can export into various formats
like markdown HTML JSON and most
importantly they have extensive OCR
support for scanned PDF and images so
this one library has all of these things
if you are dealing with complex PDFs or
if you are dealing with um
complex images rather.
One more thing which we explored at
Vijara recently is this Mistral
uh OCR. How many of you are aware of this?
They have a special language. They have
a special model which they have released
recently which is apparently supposed to
okay so one good question has been asked
in the chat that what about this
mirrorboard itself if it's to be
retrieved now let me ask this question
to all of you right let's take this
mirror board which I have
which tool will you use to retrieve text
Dockling for sure will be good but I
would probably use tesseract for this.
The reason I would use tesseract for
this is actually
what do I have here? If you think about
it, I have some images and I have some
text right which is written and this
text is very messy. So if you take images of this of course a normal pi mu
images of this of course a normal pi mu PDF will not be able to handle this. But
PDF will not be able to handle this. But I don't have too many complex like I
I don't have too many complex like I don't have any table really even this
don't have any table really even this table which I have that's an image. This
table which I have that's an image. This is not a real table. So technically I
is not a real table. So technically I don't have any tables. I have I would
don't have any tables. I have I would probably take images of this make it
probably take images of this make it into a PDF. So I just have PDF. I just
into a PDF. So I just have PDF. I just have images and I have text which will
have images and I have text which will be scanned. So I definitely need an OCR
be scanned. So I definitely need an OCR tool over here.
Yeah. Uh this Mistral right I want to spend some more time on this because
this is new. This has just come out. I think it came out three months back or
think it came out three months back or yeah we are trying it at Vijuara right
yeah we are trying it at Vijuara right now. I don't know how it is but it's
now. I don't know how it is but it's supposed to be amazing.
Uh and there are several such LLMs itself which are specifically meant for
itself which are specifically meant for doing OCR tasks
scraping right so let me tell a bit about scraping right now.
about scraping right now. So let's say you go to Mahindra and
So let's say you go to Mahindra and Mahindra website
Mahindra website and you are doing a project with
and you are doing a project with Mahindra and what they have told you is
Mahindra and what they have told you is that I want to make a chatbot which is
that I want to make a chatbot which is specific to Mahindra rise let's say but
specific to Mahindra rise let's say but they have not given you any data
they have not given you any data then what will you do at this stage how
then what will you do at this stage how do you collect the data if the client
do you collect the data if the client has not given you PDF copies of the data
has not given you PDF copies of the data or if the client has not given you uh
or if the client has not given you uh like anything about the data
like anything about the data The only thing which you can do at this
The only thing which you can do at this point is called as scraping.
point is called as scraping. Uh yeah. So what you have to do is that
Uh yeah. So what you have to do is that you have to go through different
you have to go through different sections and you have to use a scraping
sections and you have to use a scraping tool to scrape this data. I'm going to
tool to scrape this data. I'm going to tell you two to three scraping tools
tell you two to three scraping tools which can be used. So first is called
which can be used. So first is called fire crawl.
Uh again very good scraping tool. It has around 50,000
around 50,000 stars. One good thing is that it
stars. One good thing is that it actually takes with this fire crawl tool
actually takes with this fire crawl tool you probably don't even need a PDF
you probably don't even need a PDF extractor tool because it already takes
extractor tool because it already takes the entire website and converts it into
the entire website and converts it into LLM ready markdown or structured data
LLM ready markdown or structured data that's one tool then second is as
that's one tool then second is as someone has mentioned in the chat
someone has mentioned in the chat beautiful soup
if you have HTML pages especially beautiful soup looks at that and fully
beautiful soup looks at that and fully extracts that and another thing is
extracts that and another thing is called puppeteer.
U so it's it's an automation tool. Actually with puppeteer you can with
Actually with puppeteer you can with puppeteer you can do some clever things.
puppeteer you can do some clever things. So someone uh mentioned about named
So someone uh mentioned about named entity recognition right? So what if you
entity recognition right? So what if you want to go through different sections
want to go through different sections but you don't want to you only want to
but you don't want to you only want to take headings or titles from each page.
take headings or titles from each page. How will you do that with a normal web
How will you do that with a normal web scrap web scraper? It's bit difficult to
scrap web scraper? It's bit difficult to do that specific amount of extracting in
do that specific amount of extracting in puppeteer. What you can do is you can
puppeteer. What you can do is you can automate the scraping by telling that I
automate the scraping by telling that I only want font size of this this or a
only want font size of this this or a header tax to be selected. I only want
header tax to be selected. I only want paragraph tags to be selected when you
paragraph tags to be selected when you scrape.
scrape. So that way I think puppeteer you can
So that way I think puppeteer you can also install this as a
also install this as a java javascript library.
java javascript library. Yeah. So it's an API to control Chrome
Yeah. So it's an API to control Chrome or Firefox. It can go through. So my
or Firefox. It can go through. So my question to all of you is this. Let's
question to all of you is this. Let's say if you have if a client has 5,000
say if you have if a client has 5,000 links, you cannot manually go and scrape
links, you cannot manually go and scrape each link, right? You you need an
each link, right? You you need an automation tool which goes through this
automation tool which goes through this link. It scrapes whatever is there. Then
link. It scrapes whatever is there. Then it goes through this link, scrap scrapes
it goes through this link, scrap scrapes whatever is there. Puppeteer provides
whatever is there. Puppeteer provides you that advantage. You can automate an
you that advantage. You can automate an entire workflow through puppeteer and
entire workflow through puppeteer and just sit back and get all the files
just sit back and get all the files downloaded but you need to define that
downloaded but you need to define that workflow very nicely.
Selenium also selenium is good. Uh but Jay I found that puppeteer at least. So
Jay I found that puppeteer at least. So we had a pro client project where we use
we had a pro client project where we use puppeteer. They had around 5,000
puppeteer. They had around 5,000 documents they wanted to be extracted
documents they wanted to be extracted through scraping and manual scraping
through scraping and manual scraping took a long time. So we used puppeteer
took a long time. So we used puppeteer at that moment.
How effective is fire crawl when dealing with websites? Yeah. Yeah. then it's not
with websites? Yeah. Yeah. then it's not I'm not sure actually how it bypasses
I'm not sure actually how it bypasses the authentication
the authentication um of websites
um of websites I need to check that
yeah manual scraping right it takes huge amount of time in fact for the client
amount of time in fact for the client project which I mentioned earlier we
project which I mentioned earlier we were doing manual scraping but it was
were doing manual scraping but it was just too expensive in terms of time and
just too expensive in terms of time and everything
everything is extracting data in tabular format.
is extracting data in tabular format. Yeah, definitely Ashwini. So basically
Yeah, definitely Ashwini. So basically the dockling tool which I mentioned
the dockling tool which I mentioned right it can extract data from anywhere.
right it can extract data from anywhere. It can extract data from images, tabular
It can extract data from images, tabular format, it can extract data from PDF
format, it can extract data from PDF snippets. Basically anything which you
snippets. Basically anything which you want but just keep in mind that if any
want but just keep in mind that if any of you is actually working on an
of you is actually working on an industrial project, sometimes clients
industrial project, sometimes clients don't give you data even in PDF format.
don't give you data even in PDF format. Then you have to do scraping on top of
Then you have to do scraping on top of it.
Okay. Now what we are going to do is that we are going to code the first part
that we are going to code the first part which we just saw. I will take the
which we just saw. I will take the remaining questions in the chat. But
remaining questions in the chat. But first what we will do is that for this
first what we will do is that for this PDF all of us have identified that we
PDF all of us have identified that we will use pi mu PDF.
will use pi mu PDF. Uh yes. Did everyone understand why we
Uh yes. Did everyone understand why we are using py mu PDF for the current
are using py mu PDF for the current task?
task? Can you type yes in the chat if you have
Can you type yes in the chat if you have understood why we are using py mu pdf
understood why we are using py mu pdf for the current project and not any
for the current project and not any other tool. Okay, good. So now our
other tool. Okay, good. So now our coding journey is going to start. I'm
coding journey is going to start. I'm going to share this
going to share this Google collab code file with all of you
Google collab code file with all of you and after the data extraction is done,
and after the data extraction is done, we are going to take a small break. Uh
we are going to take a small break. Uh so I know attention time spans are a bit
so I know attention time spans are a bit less
less but uh no issues.
So this is the Google collab code file and someone has asked to share the PDF
and someone has asked to share the PDF right?
right? Yeah. So for the PDF I actually shared
Yeah. So for the PDF I actually shared this document at the start of the
this document at the start of the lecture itself. not document this drive
lecture itself. not document this drive folder at the start of the lecture
folder at the start of the lecture itself.
Oh yeah. So J uh that's a great point which you mentioned which I I definitely
which you mentioned which I I definitely want to address. I forgot to address
want to address. I forgot to address actually. Uh
actually. Uh so you might be thinking why does PIMU
so you might be thinking why does PIMU PDF actually exist right because it's
PDF actually exist right because it's extremely fast.
extremely fast. Pyu PDF is 10 to 15 times faster than
Pyu PDF is 10 to 15 times faster than dockling. That's the trade-off here. So
dockling. That's the trade-off here. So if you go there are some Reddit trades
if you go there are some Reddit trades which actually argue about this.
Yeah. See dockling is at least 50 times slower than pi mu PDF. So if you have a
slower than pi mu PDF. So if you have a simple text like what we do don't use
simple text like what we do don't use the
the powerful libraries unnecessarily. That
powerful libraries unnecessarily. That will just be very slow for you. But
will just be very slow for you. But that's a good point you bring up. I
that's a good point you bring up. I wanted to touch upon that but it slipped
wanted to touch upon that but it slipped my mind
my mind anyway. So all of you have access to
anyway. So all of you have access to this notebook. Now the first thing which
this notebook. Now the first thing which you have to do is you have to go to
you have to do is you have to go to runtime and you have to switch to T4
runtime and you have to switch to T4 GPU.
GPU. We are going to start very slowly and we
We are going to start very slowly and we are going to start with the data
are going to start with the data injection pipeline. Okay. So before that
injection pipeline. Okay. So before that there is some a long text here which you
there is some a long text here which you can even read after this lecture is
can even read after this lecture is done. I have covered this all in the
done. I have covered this all in the initial portion of the class. This
initial portion of the class. This schematic also I have shared on the
schematic also I have shared on the mirro board. Now what we can do is
mirro board. Now what we can do is directly start from here requirements
directly start from here requirements and setup. So if all of you are
and setup. So if all of you are connected to T4 GPU, this notebook
connected to T4 GPU, this notebook should by the way by default already
should by the way by default already connect you to T4. And then just click
connect you to T4. And then just click on this. So the first two cells are
on this. So the first two cells are where we are installing the packages.
where we are installing the packages. These two steps will take some amount of
These two steps will take some amount of time. So I'm going to wait for here till
time. So I'm going to wait for here till all of you are running this. And
all of you are running this. And meanwhile, let me answer some questions
meanwhile, let me answer some questions in the chat which I might not have seen.
in the chat which I might not have seen. Can you share the PDF? I Yeah, I think I
Can you share the PDF? I Yeah, I think I shared it right now.
shared it right now. I am working on a project where I need
I am working on a project where I need to extract release documents from GitHub
to extract release documents from GitHub pages. Is Puppeteer a good choice? Yeah,
pages. Is Puppeteer a good choice? Yeah, definitely.
definitely. First, Spurs, I would encourage you to
First, Spurs, I would encourage you to explore fire crawl
explore fire crawl because Puppeteer is a very low-level
because Puppeteer is a very low-level library. When I say lowle, it directly
library. When I say lowle, it directly operates at JavaScript. So if you want
operates at JavaScript. So if you want to use puppeteer you need to be very
to use puppeteer you need to be very comfortable with JS code.
comfortable with JS code. Fire crawl abstracts many things. So
Fire crawl abstracts many things. So it's easier to use. If you are
it's easier to use. If you are comfortable with JS then I would suggest
comfortable with JS then I would suggest to go ahead with JS. Sure.
Uh after setting up the data pipeline the biggest challenge I
data pipeline the biggest challenge I faced was keeping the changing data
faced was keeping the changing data synced with the vector database. Any
synced with the vector database. Any suggestion? Great point Prashant. I will
suggestion? Great point Prashant. I will come to this. I do have a suggestion for
come to this. I do have a suggestion for this and in one word the suggestion is
this and in one word the suggestion is to use PG vector
we are going to use PG vector. Essentially the best thing to keep
Essentially the best thing to keep database versus vector database synced
database versus vector database synced is to keep everything in one place.
is to keep everything in one place. That's the only way to do this is to use
That's the only way to do this is to use a Postgress database with PG vector.
a Postgress database with PG vector. We'll see that tomorrow.
We'll see that tomorrow. Where can I get the link to this
Where can I get the link to this notebook? So link I have already shared.
Oh I I shared the copy link. So in this copy file I have removed the hugging
copy file I have removed the hugging face access token. Yeah this is that
face access token. Yeah this is that link.
There is some question in the chat about this book. This uh actually I have not
this book. This uh actually I have not seen this yet.
Let me check this. Yeah, it seems to be very highly cited
Yeah, it seems to be very highly cited especially for vision based uh document
especially for vision based uh document retrieval.
retrieval. One metric which I look for to check how
One metric which I look for to check how popular a tool is is GitHub stars and
popular a tool is is GitHub stars and how active it is on GitHub. So it seems
how active it is on GitHub. So it seems to be quite active. Last commit was made
to be quite active. Last commit was made 5 days back.
5 days back. Um that's a good paper. I'll definitely
Um that's a good paper. I'll definitely add it to my read list.
It is asked to me in an interview if we extracted any using LLM or rag how will
extracted any using LLM or rag how will we validate if they are correct? Again a
we validate if they are correct? Again a very good question. So Prem always
very good question. So Prem always remember that there are two types of
remember that there are two types of validation right.
validation right. There is structural validation and there
There is structural validation and there is semantic validation.
When I say structural validation, it means whether the structure of your
means whether the structure of your retrieved items are correct or not. And
retrieved items are correct or not. And one way to implement structural
one way to implement structural validation which we have already seen in
validation which we have already seen in one of the previous lectures is to use
one of the previous lectures is to use piodantic
piodantic where we can check whether the format is
where we can check whether the format is correct. But to use semant but to do
correct. But to use semant but to do semantic validation there are two types
semantic validation there are two types either human as a judge or LLM as a
either human as a judge or LLM as a judge where if you want to do semantic
judge where if you want to do semantic validation either you have the ground
validation either you have the ground truth data and you validate with that or
truth data and you validate with that or you use a larger LLM to give you the
you use a larger LLM to give you the ground truth data and validate your
ground truth data and validate your extraction with that.
extraction with that. How do you keep track of good papers and
How do you keep track of good papers and make it a habit? Yeah. So that is a bit
make it a habit? Yeah. So that is a bit challenging. So one thing which has
challenging. So one thing which has honestly worked for me amit bit
honestly worked for me amit bit counterintuitive is LinkedIn. My
counterintuitive is LinkedIn. My LinkedIn feed is extremely well curated
LinkedIn feed is extremely well curated and that is also because I spend a lot
and that is also because I spend a lot of time scrolling through LinkedIn and I
of time scrolling through LinkedIn and I read mostly I'm on LinkedIn so I read
read mostly I'm on LinkedIn so I read things which I like. So algorithm picks
things which I like. So algorithm picks up on that. So everything which I get is
up on that. So everything which I get is from people who talk about new things.
from people who talk about new things. Um so I'm following some key set of
Um so I'm following some key set of people who whenever something new is
people who whenever something new is released they will post it.
released they will post it. So mostly I'm trying to avoid flashy
So mostly I'm trying to avoid flashy things on LinkedIn. There are like two
things on LinkedIn. There are like two camps. One camp is like whenever let's
camps. One camp is like whenever let's say context engineering right whenever
say context engineering right whenever context engineering is a thing then
context engineering is a thing then someone will make a post that five
someone will make a post that five reasons why you should learn context
reasons why you should learn context engineering. I avoid those but on my
engineering. I avoid those but on my feed there are people who write about
feed there are people who write about let's say context engineering what are
let's say context engineering what are the papers you should read then how is
the papers you should read then how is it different from so more informative
it different from so more informative and
and not too much flash it's getting a
not too much flash it's getting a challenge for me but I make it a point
challenge for me but I make it a point to at least read two papers per week
to at least read two papers per week and also implement those
I I don't have a two I do have a two read list I'll share it with you. That's
read list I'll share it with you. That's only that I make it week to week. I have
only that I make it week to week. I have it for this week. So in this week's two
it for this week. So in this week's two read list, I have this transfusion
read list, I have this transfusion paper.
This paper is on my to read list. This this week
this week uh
uh and one more thing is on my to read list
and one more thing is on my to read list is the link which I already shared with
is the link which I already shared with you.
you. It's this.
It's this. In fact, I already ordered one of these
In fact, I already ordered one of these books to our office because I'm now
books to our office because I'm now encouraging all of our people to master
encouraging all of our people to master GPU programming. I can't believe they
GPU programming. I can't believe they made this free. It's it's the amazing
made this free. It's it's the amazing but an extremely complex go through of
but an extremely complex go through of how LLM utilize our GPUs.
how LLM utilize our GPUs. But I I like ordering physical books.
But I I like ordering physical books. So, I've ordered two copies of this for
So, I've ordered two copies of this for our office. This is also on my to read
our office. This is also on my to read list. I finished two chapters. I'm going
list. I finished two chapters. I'm going to make a course on this because I have
to make a course on this because I have literally not found a single good course
literally not found a single good course on GPU programming anywhere.
Uh, okay. So, how many of you have finished running until these two steps
finished running until these two steps at the moment?
at the moment? How many of you have finished installing
How many of you have finished installing packages?
packages? You have, right? Okay, good. Now, what
You have, right? Okay, good. Now, what we have to do is that the next step is
we have to do is that the next step is just document processing. So in this
just document processing. So in this part we are going to uh download the
part we are going to uh download the PDF.
PDF. If it does not exist it's fine. So one
If it does not exist it's fine. So one way is to just add it on the left hand
way is to just add it on the left hand side over here. But if it does not exist
side over here. But if it does not exist in this code we'll just go ahead and
in this code we'll just go ahead and download the PDF. And the next code
download the PDF. And the next code block is where we are actually going to
block is where we are actually going to read this PDF. So let's go through this
read this PDF. So let's go through this code block step by step. First there is
code block step by step. First there is a text formatter. So what it will do is
a text formatter. So what it will do is that it will make sure there are not
that it will make sure there are not empty spaces in any of the text which we
empty spaces in any of the text which we are reading. Then we have this open and
are reading. Then we have this open and read PDF. So this import fits which we
read PDF. So this import fits which we are doing right that's the pyu pdf.
are doing right that's the pyu pdf. This py mu pdfdf github repository when
This py mu pdfdf github repository when we do import fits that loads the
we do import fits that loads the package. Um and the way we open a file
package. Um and the way we open a file through pyu pdf is doing fits.open.
through pyu pdf is doing fits.open. Then what we are going to do is that we
Then what we are going to do is that we are going to go through every single
are going to go through every single page in my document. I'm going to get
page in my document. I'm going to get text from that page. So page dot get
text from that page. So page dot get text. Okay. Then what I'm going to do is
text. Okay. Then what I'm going to do is that I'm going to format
that I'm going to format uh this text to remove empty spaces.
uh this text to remove empty spaces. Um and then what I'm going to do I'm
Um and then what I'm going to do I'm going to maintain a list. So I'm going
going to maintain a list. So I'm going to maintain a list like this.
to maintain a list like this. So every page for each page I'm going to
So every page for each page I'm going to store the page number the number of
store the page number the number of characters on that page the word count
characters on that page the word count on that page the number of sentences on
on that page the number of sentences on that page and the actual text.
that page and the actual text. Okay.
Okay. So what this piece of code is doing is
So what this piece of code is doing is that we are maintaining a list called
that we are maintaining a list called pages and text and each element of that
pages and text and each element of that list is a dictionary.
list is a dictionary. So the first element
first element of this list is page one. First element of this list is page one.
First element of this list is page one. And what will page one have? Page one is
And what will page one have? Page one is a dictionary
Similarly, the second element of this is page two etc. So what I'm essentially
page two etc. So what I'm essentially doing is that I'm making a list
and in each list this is page number one. This is page number two dot dot dot
one. This is page number two dot dot dot right up till page number one 208
right up till page number one 208 and in each page
and in each page each page list I'm storing these values.
each page list I'm storing these values. I'm storing the page number. I'm storing
I'm storing the page number. I'm storing the text. Of course, the main thing is
the text. Of course, the main thing is the text also I'm storing.
the text also I'm storing. So you can run this now and then what
So you can run this now and then what you can do is that you can
you can do is that you can just randomly print out two dictionaries
just randomly print out two dictionaries from this list. So I have printed out
from this list. So I have printed out the page number text for one page and
the page number text for one page and this is for second page. So you might be
this is for second page. So you might be wondering why is this minus 41 here,
wondering why is this minus 41 here, right? Why am I subtracting minus 41
right? Why am I subtracting minus 41 over here?
over here? The reason is that if you actually take
The reason is that if you actually take a look at our
a look at our uh book right, it really starts from
uh book right, it really starts from page number 41 or 42 here. This is where
page number 41 or 42 here. This is where our book actually starts. Yeah. Here. So
our book actually starts. Yeah. Here. So what is actually page number one
what is actually page number one is page number. So you need to subtract
is page number. So you need to subtract 42 pages actually to get to page number
42 pages actually to get to page number one.
one. So all of the pages which come before
So all of the pages which come before this are marked as negative since we
this are marked as negative since we subtract 41 and then page number one
subtract 41 and then page number one will rightly start from here.
Uh and then what we can do is that we can
and then what we can do is that we can just get a random sample. So now our
just get a random sample. So now our dictionary or our list is called pages
dictionary or our list is called pages and text right that is our list. We can
and text right that is our list. We can get a random element from this list. So
get a random element from this list. So we can see we have got page number 1019.
we can see we have got page number 1019. The number of characters are 1574.
The number of characters are 1574. Number of words are 270. Oh by the way
Number of words are 270. Oh by the way we are also maintaining number of
we are also maintaining number of tokens. So for this the simple thing we
tokens. So for this the simple thing we are doing is number of characters
are doing is number of characters divided by four.
divided by four. That's the number of tokens which we are
That's the number of tokens which we are assuming.
assuming. So each page dictionary will look
So each page dictionary will look something like this. We have page
something like this. We have page number, the number of characters on that
number, the number of characters on that page, the number of words on that page,
page, the number of words on that page, the number of sentences on that page,
the number of sentences on that page, the actual text. That's it.
Uh and then what you can do is that uh you
and then what you can do is that uh you can actually get some statistics on the
can actually get some statistics on the text. So just run this
text. So just run this and uh get the different statistics. So
and uh get the different statistics. So for example
for example for each so for this page the character
for each so for this page the character count is 29 the word count is four the
count is 29 the word count is four the sentence count is one the page token
sentence count is one the page token count is 7.25
um and then you can get an overall statistics. So this is the main thing
statistics. So this is the main thing which we want to focus on right now.
which we want to focus on right now. Mean let's take a look at this mean row.
Mean let's take a look at this mean row. So on an average all pages have roughly
So on an average all pages have roughly around 198 words. On an average all
around 198 words. On an average all pages have around 10 words, 10 sentences
pages have around 10 words, 10 sentences roughly. And on an average, each page
roughly. And on an average, each page has around 287 words.
Why is this important? Why are we looking at the number of uh tokens on
looking at the number of uh tokens on each page?
each page? Can someone try to think
Can someone try to think why are we looking at the number of
why are we looking at the number of tokens on each page?
There is an error which Krishna has got pages and text is not defined. Krishna
pages and text is not defined. Krishna have you run this
have you run this because we have defined pages and text
because we have defined pages and text over here.
Now I'm going to the whiteboard and the question which I'm asking to all of you
question which I'm asking to all of you is that we got these statistics right?
is that we got these statistics right? We got these statistics
We got these statistics that each page has let's say
yeah so eventually we want to take so let's say we want to take a page and we
let's say we want to take a page and we want to convert the page into an
want to convert the page into an embedding vector
let's say we use this model all MP net base V2.
base V2. The issue is that
The issue is that in very fine print they have mentioned
in very fine print they have mentioned that input text longer than 384 word
that input text longer than 384 word pieces is truncated.
pieces is truncated. So that is going to be an issue for us.
So that is going to be an issue for us. If our page is more than 384 or 400
If our page is more than 384 or 400 words, we cannot embed our entire page
words, we cannot embed our entire page into a vector using this model because
into a vector using this model because then some information will unfortunately
then some information will unfortunately be lost.
So that's why just a better idea to make sure that
sure that whenever you're looking at pages just
whenever you're looking at pages just take a look at okay how many words do
take a look at okay how many words do they have on an average how many tokens
they have on an average how many tokens do they have on average. So here it
do they have on average. So here it seems that each page on an average is
seems that each page on an average is 287 tokens which is lesser than 384
287 tokens which is lesser than 384 right. So it is fine to go ahead with
right. So it is fine to go ahead with this. So potentially each page can be
this. So potentially each page can be embedded with embedding models.
embedded with embedding models. Currently we have not decided which
Currently we have not decided which embedding model to use. We have not even
embedding model to use. We have not even decided if one page is equal to one
decided if one page is equal to one chunk. But potentially if we decide that
chunk. But potentially if we decide that one page is one chunk and we want to
one page is one chunk and we want to embed each page, we can very safely use
embed each page, we can very safely use this allimpinate base version two.
this allimpinate base version two. That's the reason why we should actually
That's the reason why we should actually keep a track of how many tokens are
keep a track of how many tokens are there on each page, how many words are
there on each page, how many words are there on each page. The thing is when
there on each page. The thing is when you directly use rag libraries on lang
you directly use rag libraries on lang chain, all of this information is lost
chain, all of this information is lost to you. They directly give you a PDF but
to you. They directly give you a PDF but you should yourself see how many pages
you should yourself see how many pages are there what's the token count on each
are there what's the token count on each page what's the word count on each page
page what's the word count on each page what's the sentence count on each page
what's the sentence count on each page etc
etc we are going to come to chunking right
we are going to come to chunking right now so don't worry about it the next
now so don't worry about it the next thing which we are going to do is
thing which we are going to do is chunking Rahul has asked a question is
chunking Rahul has asked a question is rag plus SLM a practical combination
rag plus SLM a practical combination yeah definitely
yeah definitely because uh rags are much better than
because uh rags are much better than fine-tuning. Anyways, we'll come to that
fine-tuning. Anyways, we'll come to that actually after the lecture is done
actually after the lecture is done that is mean. What about max? Sure, max.
that is mean. What about max? Sure, max. But check the standard deviation also,
But check the standard deviation also, right? Standard deviation is 140. So,
right? Standard deviation is 140. So, even with this, it seems the one or two
even with this, it seems the one or two standard deviations are around let's say
standard deviations are around let's say 400 token length or something. So, we
400 token length or something. So, we are fine.
Sorry, I did not understand what you mean about lang chain. So uh when you
mean about lang chain. So uh when you see tutorials of rag on lang chain llama
see tutorials of rag on lang chain llama index these tutorials are 10 to 15
index these tutorials are 10 to 15 minute long and they completely skip
minute long and they completely skip this part. They already assume that you
this part. They already assume that you have a PDF and then everything starts
have a PDF and then everything starts like at a much later stage. But in
like at a much later stage. But in practice this is what you have to do
practice this is what you have to do first. So this is the exploratory data
first. So this is the exploratory data analysis equivalent. When we do a normal
analysis equivalent. When we do a normal machine learning problem we do EDA
machine learning problem we do EDA right? You also need to do some EDA when
right? You also need to do some EDA when you do rag.
There is a question about lecture recording. So I I I will share the
recording. So I I I will share the lecture recording and the Google collab.
lecture recording and the Google collab. I have already shared it on chat.
Okay. So now we are going to take a break for some time and we are going to
break for some time and we are going to cover chunking. I definitely do want to
cover chunking. I definitely do want to cover chunking today because uh it is
cover chunking today because uh it is one of the most important pieces of the
one of the most important pieces of the puzzle and nowhere
puzzle and nowhere on the internet on any YouTube video I
on the internet on any YouTube video I found comprehensive
found comprehensive uh explanations of chunking. In fact,
uh explanations of chunking. In fact, there are blogs on chunking. There are
there are blogs on chunking. There are good blogs
good blogs but blogs can only take you so far.
but blogs can only take you so far. Right? In the chunking
Right? In the chunking uh section what I'm going to do is that
uh section what I'm going to do is that first we are going to understand all the
first we are going to understand all the types of chunking in detail and then we
types of chunking in detail and then we are actually going to code different
are actually going to code different chunking strategies. We are going to
chunking strategies. We are going to code them from scratch and we are going
code them from scratch and we are going to compare these different chunking
to compare these different chunking strategies with each other.
strategies with each other. Um but we will take a break. So earlier
Um but we will take a break. So earlier what I had planned is that I planned one
what I had planned is that I planned one and a half hour for today and one and a
and a half hour for today and one and a half hour for tomorrow. But I think
half hour for tomorrow. But I think today
today today itself we will take around 2 and a
today itself we will take around 2 and a half hours it looks like. So if any of
half hours it looks like. So if any of you uh I did not plan three-hour
you uh I did not plan three-hour workshop today Sanjay honestly but uh
workshop today Sanjay honestly but uh it's good that you are asking so many
it's good that you are asking so many questions
questions we have many more things left to cover.
we have many more things left to cover. So it depends on your time schedule. If
So it depends on your time schedule. If all of you want to catch the recording
all of you want to catch the recording you can do that. But anyways, I will
you can do that. But anyways, I will come back now after 5 minutes
come back now after 5 minutes uh to start the chunking part. If you
uh to start the chunking part. If you are available, you can stay stay live to
are available, you can stay stay live to watch the chunking. If not, I'm going to
watch the chunking. If not, I'm going to upload the lecture recording anyways.
Uh yeah, Samrat, when we do chunking, it's
yeah, Samrat, when we do chunking, it's not
not not we don't need the EDA really later
not we don't need the EDA really later when we do the chunking, but it's still
when we do the chunking, but it's still good to see what's the number of tokens
good to see what's the number of tokens we have. It might change our intuition
we have. It might change our intuition later.
later. Okay. So I'll just come back after 4 to
Okay. So I'll just come back after 4 to 5 minutes. It might take 1 to one and a
5 minutes. It might take 1 to one and a half more hours today because so what we
half more hours today because so what we can do today we can finish chunking and
can do today we can finish chunking and then tomorrow we can do embedding the
then tomorrow we can do embedding the LLM and then the final production.
LLM and then the final production. Yeah. Thanks guys. I'll I'll come back
Yeah. Thanks guys. I'll I'll come back after after around maybe 9:35.
Um all right everyone. So let's begin with the next part of today's lecture
with the next part of today's lecture which is going to be chunking.
uh there is a reason why I have allocated a separate section to this
allocated a separate section to this because
because I believe it is one of the most
I believe it is one of the most important pieces of the rack pipeline.
important pieces of the rack pipeline. Um let me explain to all of you why
Um let me explain to all of you why chunking is important. So until now what
chunking is important. So until now what we have seen is that we have processed
we have seen is that we have processed the
the uh we have processed the PDF.
So we have processed the PDF that part is done.
is done. Okay. Now what we have to do is we
Okay. Now what we have to do is we finally this is our LLM.
finally this is our LLM. The LLM will get a prompt from the user
The LLM will get a prompt from the user of course but the LLM will also get some
of course but the LLM will also get some retrieved information.
The retrieved information from our knowledge base or from the PDF.
Now what we are doing in the chunking section is that we are essentially
section is that we are essentially bridging this gap.
bridging this gap. We are essentially bridging the gap. So
We are essentially bridging the gap. So now we have processed the PDF. How do we
now we have processed the PDF. How do we go from this PDF to retrieving bits of
go from this PDF to retrieving bits of information which are important and
information which are important and there are two key steps to this. The
there are two key steps to this. The first is chunking
first is chunking and the second is called as embedding.
and the second is called as embedding. We are going to look at embedding
We are going to look at embedding tomorrow but today let's cover chunking.
tomorrow but today let's cover chunking. So the way it works is that if let's say
So the way it works is that if let's say let me take a sample
The first thing which I'm going to do is that I'm going to divide this PDF into
that I'm going to divide this PDF into chunks.
chunks. And uh when I say chunk, a chunk can be
And uh when I say chunk, a chunk can be let's say these are my chunks.
This can be one type of chunking. So imagine that in the whole PDF every
imagine that in the whole PDF every sentence is one chunk or you can even
sentence is one chunk or you can even have chunking which is page page level
have chunking which is page page level chunking. So this entire page is one
chunking. So this entire page is one chunk. This entire page is another chunk
chunk. This entire page is another chunk etc.
etc. Now let's say you do some sort of
Now let's say you do some sort of chunking and you have these chunks.
You have 2,000 chunks. Let's say you have split the documents. You have split
have split the documents. You have split the knowledge base or the document which
the knowledge base or the document which you have into 2,000 chunks.
you have into 2,000 chunks. In the retrieved information,
In the retrieved information, the only portion which you are going to
the only portion which you are going to select is some of these chunks. Maybe
select is some of these chunks. Maybe you select the chunks which are most
you select the chunks which are most closely related to the prompt. The top
closely related to the prompt. The top chunk. So you can select this one chunk
chunk. So you can select this one chunk or you might select top three chunks
or you might select top three chunks which are most closely related to the
which are most closely related to the prompt.
prompt. So you can select one chunk, you can
So you can select one chunk, you can select two chunks, you can select three
select two chunks, you can select three chunks that you have to decide. But if
chunks that you have to decide. But if normally people select between 1 to 10
normally people select between 1 to 10 chunks. So let's say you select three
chunks. So let's say you select three chunks. These are the three chunks which
chunks. These are the three chunks which will be passed as the retrieved
will be passed as the retrieved information.
information. Now you see the problem here is that or
Now you see the problem here is that or I should not call problem.
I should not call problem. Your the quality of your output is going
Your the quality of your output is going to completely and solely depend on your
to completely and solely depend on your retrieved information and your retrieved
retrieved information and your retrieved information is going to completely
information is going to completely defend depend on what type of chunks you
defend depend on what type of chunks you have. Because if you have granular
have. Because if you have granular chunking like sentences, this will be
chunking like sentences, this will be just one sentence. This will be second
just one sentence. This will be second sentence and this will be third
sentence and this will be third sentence. So you'll just pass three
sentence. So you'll just pass three sentences. But if you have broad level
sentences. But if you have broad level chunking like pages then each chunk will
chunking like pages then each chunk will be one page. So you'll be passing page
be one page. So you'll be passing page one, you'll be passing page two and
one, you'll be passing page two and you'll be passing page three.
you'll be passing page three. So
So imagine this as the brain of the LLM and
imagine this as the brain of the LLM and uh so this is the LLM and this is the
uh so this is the LLM and this is the data.
data. the retrieved information which passes
the retrieved information which passes through the LLM will be from a list of
through the LLM will be from a list of chunks and only a subset of these chunks
chunks and only a subset of these chunks will be passed to the LLM. So from the
will be passed to the LLM. So from the engineer's perspective it becomes
engineer's perspective it becomes extremely important to decide how are we
extremely important to decide how are we exactly going to do the chunking. There
exactly going to do the chunking. There are so many ways right the the sky is
are so many ways right the the sky is completely open that we can do anything.
completely open that we can do anything. So now let me ask all of you. Let's say
So now let me ask all of you. Let's say this is the PDF
this is the PDF U 1,28 pages PDF. How should we go about
U 1,28 pages PDF. How should we go about chunking?
chunking? What will you have as individual chunks?
What will you have as individual chunks? Heading wise. So Samrat is saying
Heading wise. So Samrat is saying heading wise, right? U so essentially I
heading wise, right? U so essentially I think what Samrat is saying that
think what Samrat is saying that wherever there are headings
wherever there are headings you make that as one chunk. So make this
you make that as one chunk. So make this as one chunk. So if carbohydrates is a
as one chunk. So if carbohydrates is a heading, make carbohydrate section as
heading, make carbohydrate section as one chunk. If lipids is a heading, make
one chunk. If lipids is a heading, make this section as one chunk. If proteins
this section as one chunk. If proteins is a heading, make this section as one
is a heading, make this section as one chunk. Um I think that's what Dishant
chunk. Um I think that's what Dishant means by sections.
means by sections. Aditya has an interesting suggestion.
Aditya has an interesting suggestion. What Aditya is saying is that let me not
What Aditya is saying is that let me not focus on the structure of the PDF. I I
focus on the structure of the PDF. I I will actually write down all of your
will actually write down all of your suggestions over here. So the first
suggestions over here. So the first suggestion is based on the structure
suggestion is based on the structure right. So if we do
right. So if we do based on headings
um what what else then the suggestion by Adita is with respect to similar topics
Adita is with respect to similar topics or semantics
then JP document structure. So here let me
me bucket this in this segment itself and
bucket this in this segment itself and let me call this as a document structure
let me call this as a document structure at the moment.
at the moment. If they are bit big divided into
If they are bit big divided into paragraph limited amount of word size
paragraph limited amount of word size has to be the same maximum number of
has to be the same maximum number of tokens LLM can handle. So let me do the
tokens LLM can handle. So let me do the third category as fixed.
third category as fixed. So when I say fixed maybe it's 10
So when I say fixed maybe it's 10 sentences as one token
sentences as one token or 10 words as one token
or 10 words as one token or one word as one token
or one word as one token whatever this is fixed size chunking. So
whatever this is fixed size chunking. So intuitively if this terminology of
intuitively if this terminology of chunking was not known to me
chunking was not known to me or if I had not studied the literature
or if I had not studied the literature of lit retrieval augmented generation I
of lit retrieval augmented generation I would have intuitively said that one
would have intuitively said that one chunk is one section because when I read
chunk is one section because when I read a piece of PDF my mind thinks in terms
a piece of PDF my mind thinks in terms of sections right so if a certain
of sections right so if a certain question is asked by the user ideally
question is asked by the user ideally you should retrieve a full section and
you should retrieve a full section and give it as the answer right I don't want
give it as the answer right I don't want to just retrieve few sentences.
to just retrieve few sentences. I I want to retrieve entire sections and
I I want to retrieve entire sections and pass it. That's why I think chunking
pass it. That's why I think chunking should be done section wise.
should be done section wise. That can be one example. There are
That can be one example. There are people who are mentioning recursive work
people who are mentioning recursive work plus overlap. For some people this might
plus overlap. For some people this might not be clear. So I'll come to that
not be clear. So I'll come to that eventually.
eventually. Um okay. So that's the intuition which
Um okay. So that's the intuition which comes to my mind.
comes to my mind. Now what we can do is that let's go
Now what we can do is that let's go through the five types of chunking which
through the five types of chunking which we are going to see. Um and then towards
we are going to see. Um and then towards the end we will also have an engineer
the end we will also have an engineer choice section on which chunking
choice section on which chunking strategy to use and then we will code
strategy to use and then we will code the different chunking strategies and
the different chunking strategies and actually see their similarities and
actually see their similarities and differences. So my hope is that after
differences. So my hope is that after this section all of you should
this section all of you should understand the trade-offs. So at the
understand the trade-offs. So at the start of the lecture I mentioned about
start of the lecture I mentioned about trade-offs right? There are a lot of
trade-offs right? There are a lot of trade-offs in different chunking
trade-offs in different chunking strategies. There is no one-sizefits-all
strategies. There is no one-sizefits-all approach and uh different chunking
approach and uh different chunking strategies definitely lead to different
strategies definitely lead to different results.
results. In fact, what we have done here is that
we have actually made a PDF. I'm trying to find that PDF right now. Just a
to find that PDF right now. Just a minute.
Yeah. So within our company, we have made this PDF of chunking strategies.
made this PDF of chunking strategies. I'll share this with all of you
I'll share this with all of you where this guide is especially meant for
where this guide is especially meant for what are the different type of chunking
what are the different type of chunking strategies and which chunking strategy
strategies and which chunking strategy to use at what time. This is one of the
to use at what time. This is one of the most important things to understand for
most important things to understand for engineers especially and my main purpose
engineers especially and my main purpose with this workshop is how to make
with this workshop is how to make engineering decisions like this. But to
engineering decisions like this. But to make engineering decisions like this
make engineering decisions like this first we have to understand what are the
first we have to understand what are the different chunking strategies right
different chunking strategies right u so let's let's start now before even
u so let's let's start now before even evaluating between different chunking
evaluating between different chunking strategies or coding all of you need to
strategies or coding all of you need to understand what is exactly done some
understand what is exactly done some chunking strategies are easy to
chunking strategies are easy to understand some are slightly more
understand some are slightly more detailed but each of them serve a
detailed but each of them serve a specific purpose so first let's go with
specific purpose so first let's go with fixed size chunking so in fixed size
fixed size chunking so in fixed size chunking what is actually done is that
chunking what is actually done is that let's say let's actually take a PDF
let's say let's actually take a PDF let's take this legal services agreement
let's take this legal services agreement let's say you are making a rack system
let's say you are making a rack system for a legal domain right and if you have
for a legal domain right and if you have a PDF which looks like this
a PDF which looks like this where there is responsibilities of law
where there is responsibilities of law firm and client whatever in fixed
firm and client whatever in fixed chunking strategies what you mention is
chunking strategies what you mention is that I will uh have every chunk to be of
that I will uh have every chunk to be of a fixed size so let's say my chunk
a fixed size so let's say my chunk is of uh
is of uh let's say 200 words. So all my chunks
let's say 200 words. So all my chunks are going to be 200 words. I'm not going
are going to be 200 words. I'm not going to look at anything else. My chunk one
to look at anything else. My chunk one is going to be 200 words. My chunk two
is going to be 200 words. My chunk two is going to be 200 words. That's it. And
is going to be 200 words. That's it. And I can also have a slight overlap between
I can also have a slight overlap between these chunks so as to make sure that
these chunks so as to make sure that some amount of context is retained.
some amount of context is retained. But can you tell me what's the drawback
But can you tell me what's the drawback with this approach? What's the
with this approach? What's the advantages and what's the disadvantages
advantages and what's the disadvantages with this approach according to you? So
with this approach according to you? So now here again try to think from first
now here again try to think from first principles right imagine you are making
principles right imagine you are making this rag system where you want to make a
this rag system where you want to make a chatbot where a customer asks something
chatbot where a customer asks something about an agreement and your chatbot
about an agreement and your chatbot should answer. Now the retrieved
should answer. Now the retrieved information will come in chunks. Why or
information will come in chunks. Why or why not should you go ahead with a fixed
why not should you go ahead with a fixed chunking strategy like this with each
chunking strategy like this with each chunk being of 200 words
chunk being of 200 words incomplete responses
incomplete responses uh context not called sentence cut in
uh context not called sentence cut in between
between lacks contextual overlap.
lacks contextual overlap. So yeah let's take a look at this
So yeah let's take a look at this example itself right where it is being
example itself right where it is being cut. So responsibilities of law firm and
cut. So responsibilities of law firm and client this should ideally be one full
client this should ideally be one full section right and I want this entire
section right and I want this entire thing to be passed into my retrieved
thing to be passed into my retrieved information but because of this chunking
information but because of this chunking what has happened is that let's say this
what has happened is that let's say this chunk has responsibilities of law firm
chunk has responsibilities of law firm and client right so when a user asks on
and client right so when a user asks on a chatbot what are the responsibilities
a chatbot what are the responsibilities of a law firm and client if this is the
of a law firm and client if this is the user asks this this chunk will be
user asks this this chunk will be retrieved
retrieved But this chunk actually does not have
But this chunk actually does not have anything related to it has some amount
anything related to it has some amount of context because we are retaining some
of context because we are retaining some overlap but most of it is related to
overlap but most of it is related to some other sections. So this chunk will
some other sections. So this chunk will not be retrieved which means that we are
not be retrieved which means that we are actually losing out on this this much
actually losing out on this this much amount of information which is
amount of information which is completely relevant to our current
completely relevant to our current section.
That's one major disadvantage with fixed size chunking. Chunks can be made in the
size chunking. Chunks can be made in the middle of important paragraphs. Chunks
middle of important paragraphs. Chunks can be made in the middle of sentences.
can be made in the middle of sentences. A good question is asked, won't
A good question is asked, won't embeddings create a match for similar
embeddings create a match for similar text? Embeddings will create a match.
text? Embeddings will create a match. But what if your chunk is formed at a
But what if your chunk is formed at a place where
place where there is nothing with respect to the
there is nothing with respect to the question which is asked. Let's say if
question which is asked. Let's say if it's just two sentences at the end of a
it's just two sentences at the end of a paragraph where the context of what
paragraph where the context of what comes before is lost. If your chunk
comes before is lost. If your chunk unluckily comes at that point where in
unluckily comes at that point where in that particular section the information
that particular section the information of the title is lost
of the title is lost then that paragraph won't be retrieved
then that paragraph won't be retrieved and currently I'm just showing a small
and currently I'm just showing a small paragraph right if you have a huge
paragraph right if you have a huge paragraph related to a section and if
paragraph related to a section and if you randomly make a chunk halfway some
you randomly make a chunk halfway some of your information can be lost in the
of your information can be lost in the retrieved chunks.
Uh can the chunks be linked? So the chunks cannot be linked because let's
chunks cannot be linked because let's say we have chunks.
say we have chunks. Uh when you say chunks linked that leads
Uh when you say chunks linked that leads to structural chunking actually which
to structural chunking actually which will come later. In fixedsize chunking
will come later. In fixedsize chunking this is the main issue. So then why
this is the main issue. So then why would anyone do fixed size chunking? Can
would anyone do fixed size chunking? Can can you think of an application where
can you think of an application where people do fixed size chunking?
So one one lesson which all of us learned at the moment is that if your
learned at the moment is that if your document has a structure like sections,
document has a structure like sections, subsections etc. Never go with fixed
subsections etc. Never go with fixed size chunking because it might cut a
size chunking because it might cut a section halfway. Fixed size chunking is
section halfway. Fixed size chunking is used in places where you want fast
used in places where you want fast processing. Let's say if you have
processing. Let's say if you have millions and billions of documents,
millions and billions of documents, right? um or hundreds of thousands of
right? um or hundreds of thousands of documents. And if you want a quick
documents. And if you want a quick strategy without too much overhead, if
strategy without too much overhead, if you want the speed of processing to be
you want the speed of processing to be quick, then you go ahead with a fixed
quick, then you go ahead with a fixed size junking
size junking because it will just be very fast. Like
because it will just be very fast. Like if you are collecting information from
if you are collecting information from Reddit, if you are collecting
Reddit, if you are collecting information from Twitter, mostly the
information from Twitter, mostly the information will be disorganized in
information will be disorganized in threads, in comments, uh no clear
threads, in comments, uh no clear structure, no clear subheading, random
structure, no clear subheading, random messy information but huge amount of
messy information but huge amount of information. If you have random messy
information. If you have random messy chaotic information which is huge in
chaotic information which is huge in number and if you want to process it
number and if you want to process it quickly, you can use chunking uh you can
quickly, you can use chunking uh you can use fixed size chunking with some
use fixed size chunking with some overlap.
overlap. Now these are the advantages and
Now these are the advantages and disadvantages of fixed size chunking.
disadvantages of fixed size chunking. Quick fast processing is the advantage.
Quick fast processing is the advantage. The disadvantage is that it has semantic
The disadvantage is that it has semantic breaks and the context is lost.
breaks and the context is lost. Uh so the strategy is best used in
Uh so the strategy is best used in scenarios where documents are large and
scenarios where documents are large and numerous and a quick segmentation is
numerous and a quick segmentation is needed without requiring deep
needed without requiring deep understanding of the context.
understanding of the context. uh for instance if you are processing
uh for instance if you are processing millions of web pages for indexing and
millions of web pages for indexing and can tolerate some loss of coherence in
can tolerate some loss of coherence in chunks fixed size chunking is a viable
chunks fixed size chunking is a viable approach.
approach. Now remember that as the size of your
Now remember that as the size of your chunk increases your embedding model
chunk increases your embedding model size also needs to increase
size also needs to increase proportionately.
proportionately. So keep that in mind that's a trade-off
So keep that in mind that's a trade-off with larger chunks.
One other use may be in streaming or sequential processing. Yeah, correct. As
sequential processing. Yeah, correct. As it's easy to handle streams of text
it's easy to handle streams of text without worrying about sentence breaks.
without worrying about sentence breaks. Agreed. Another use is book like
Agreed. Another use is book like fountain head. Yeah, sure. if you have
fountain head. Yeah, sure. if you have books or u let's say if I go to
yeah take a look at this book right it's a huge book which with which has no
a huge book which with which has no structure it has no no headings no
structure it has no no headings no subheadings
subheadings such kind of text it might be good idea
such kind of text it might be good idea to maybe go ahead with fixed size
to maybe go ahead with fixed size chunking and if you have thousand such
chunking and if you have thousand such books then definitely go ahead with
books then definitely go ahead with fixed size chunking so so let's say if
fixed size chunking so so let's say if you're doing a project on project
you're doing a project on project Gutenberg
Gutenberg and your task is to transcribe all the
and your task is to transcribe all the books let's say and come up with some
books let's say and come up with some sort of a rack system it might be better
sort of a rack system it might be better to go ahead with fixed size chunking
to go ahead with fixed size chunking okay that's the first strategy The
okay that's the first strategy The second strategy is what someone already
second strategy is what someone already mentioned in the chat. Now again here
mentioned in the chat. Now again here I'm taking the same example which I
I'm taking the same example which I showed you over here.
Now let's say you take a book from here the same book
you take a book from here the same book which we saw. Um the main issue again
which we saw. Um the main issue again with fixed size chunking is that
with fixed size chunking is that although it's fast it does not retain
although it's fast it does not retain anything about semantics. Right?
anything about semantics. Right? It does not retain anything about
It does not retain anything about meaning. There is no meaning between one
meaning. There is no meaning between one chunk or another chunk. Semantic
chunk or another chunk. Semantic chunking tries to solve this issue. So
chunking tries to solve this issue. So the way semantic chunking works is that
the way semantic chunking works is that first you have to define a level of
first you have to define a level of organization.
organization. So by level of organization,
So by level of organization, I mean whether it's at a sentence level.
I mean whether it's at a sentence level. So if I want a sentence level
So if I want a sentence level organization, what I will do is that I
organization, what I will do is that I will take the first sentence, right? So
will take the first sentence, right? So let's say you have chunk number one.
let's say you have chunk number one. And that's like a box.
And that's like a box. I will take my first sentence. I will
I will take my first sentence. I will add it to the box. Okay. Then what I
add it to the box. Okay. Then what I will do is that I will take my second
will do is that I will take my second sentence.
sentence. I will
I will compare the embedding of this sentence
compare the embedding of this sentence and this sentence. So I will so let's
and this sentence. So I will so let's say this is sentence one and sentence
say this is sentence one and sentence two. Um
two. Um so sentence one will be converted into a
so sentence one will be converted into a vector embedding.
Sentence two will be converted into a vector embedding and I will check if the
vector embedding and I will check if the similarity score between these two
similarity score between these two vector embeddings is greater than a
vector embeddings is greater than a threshold. Let's say 8.
threshold. Let's say 8. If it's greater than this threshold, I
If it's greater than this threshold, I know that both of these sentences are
know that both of these sentences are kind of meaning the same. So then I will
kind of meaning the same. So then I will add this second sentence also over here
add this second sentence also over here because it passes my similarity
because it passes my similarity criteria. Then what I will do is that I
criteria. Then what I will do is that I will again go to the third sentence.
will again go to the third sentence. I will embed the third sentence into a
I will embed the third sentence into a vector and I will compare its cosine
vector and I will compare its cosine similarity with sentence number one. If
similarity with sentence number one. If it again passes the threshold, I will
it again passes the threshold, I will add it to my chunk.
add it to my chunk. I will keep on doing this until I have
I will keep on doing this until I have sentences which have good cosine
sentences which have good cosine similarity with my original sentence.
similarity with my original sentence. And the moment I encounter a sentence
And the moment I encounter a sentence which does not pass this criteria, the
which does not pass this criteria, the moment I encounter a sentence whose
moment I encounter a sentence whose cosign similarity is less than this
cosign similarity is less than this threshold, I will stop this chunk.
threshold, I will stop this chunk. So that's my chunk one. It's done. Then
So that's my chunk one. It's done. Then I move to chunk number two.
I move to chunk number two. What this will ensure is that every
What this will ensure is that every chunk will have semantically similar
chunk will have semantically similar information.
information. So let's say when I go to this, let's
So let's say when I go to this, let's say I want the initial section is all
say I want the initial section is all about a drama happening between a
about a drama happening between a family. I want to have this chunk until
family. I want to have this chunk until that drama finishes. So whenever certain
that drama finishes. So whenever certain question is asked, I will only retrieve
question is asked, I will only retrieve the chunk whose semantic meaning is
the chunk whose semantic meaning is matching.
matching. That's where semantic chunking actually
That's where semantic chunking actually has an advantage over fixed size
has an advantage over fixed size chunking. It takes into account the
chunking. It takes into account the meaning. So I know that every chunk will
meaning. So I know that every chunk will have similarity in meaning.
Amit is asking what is chunking strategy used in notebook LM? Notebook LM
used in notebook LM? Notebook LM definitely uses I think chunking which
definitely uses I think chunking which takes semantics into account. So maybe
takes semantics into account. So maybe something similar to the semantic
something similar to the semantic chunking which we are looking at right
chunking which we are looking at right now.
now. If the sentences have sim high
If the sentences have sim high similarity, isn't it better to avoid
similarity, isn't it better to avoid them? Um so good question but you will
them? Um so good question but you will never be sure that why the similarity is
never be sure that why the similarity is high right so you might lose information
high right so you might lose information that way instead
that way instead two sentences can mean something similar
two sentences can mean something similar but they are in different contexts you
but they are in different contexts you can still have them so let's say if
can still have them so let's say if you're talking about forests you can be
you're talking about forests you can be talking about trees in the forest or you
talking about trees in the forest or you can be talking about taking a trip to
can be talking about taking a trip to the forest and there semantics maybe the
the forest and there semantics maybe the vector embeddings are matching so you
vector embeddings are matching so you won't want to neglect one compared to
won't want to neglect one compared to the other. Right?
the other. Right? So semantic chunking main advantage is
So semantic chunking main advantage is of course it maintains coherence and it
of course it maintains coherence and it is used
is used it is used in settings where integrity
it is used in settings where integrity of ideas is very important. So for
of ideas is very important. So for example let's say if you have uh let's
example let's say if you have uh let's say if you are listening to a parliament
say if you are listening to a parliament debate.
Let's say uh you are listening to a parliament debate.
parliament debate. and you have collected the transcripts
and you have collected the transcripts right you want to make a rag system
right you want to make a rag system related to uh you ask the question and
related to uh you ask the question and then you want to identify what was
then you want to identify what was discussed in this parliamentary debate
discussed in this parliamentary debate now usually I don't know if you have
now usually I don't know if you have seen but parliamentary debates are some
seen but parliamentary debates are some of the most unstructured and they can
of the most unstructured and they can get chaotic they can get messy
get chaotic they can get messy and but there is a flow of ideas there
and but there is a flow of ideas there is a flow of ideas in these debates
is a flow of ideas in these debates someone talks something someone else
someone talks something someone else negates it. Usually we don't know till
negates it. Usually we don't know till what time that negation proceeds or
what time that negation proceeds or let's say we don't have a clear split
let's say we don't have a clear split but ideas are there and clearly ideas
but ideas are there and clearly ideas belong in buckets
belong in buckets for such kind of transcript
for such kind of transcript rack system I would go ahead with
rack system I would go ahead with semantic chunking because I would want
semantic chunking because I would want to preserve the integrity of an idea
to preserve the integrity of an idea till the time it is discussed in one
till the time it is discussed in one chunk.
chunk. This is very sim very similar to
This is very sim very similar to educational transcripts. Let's say if
educational transcripts. Let's say if you watch a video and if you make a
you watch a video and if you make a transcript out of this, right? So let's
transcript out of this, right? So let's say
say this same video.
Let's say you watch this video and I talk about four to five things in
and I talk about four to five things in the video. But let's say I have not
the video. But let's say I have not added timestamps and I have not added
added timestamps and I have not added anything. How will you know the key
anything. How will you know the key things which are discussed in the video?
things which are discussed in the video? The only way for you to know is maintain
The only way for you to know is maintain the semantic integrity of chunks. Right?
the semantic integrity of chunks. Right? You cannot do fixed size chunking here.
You cannot do fixed size chunking here. You have to maintain semantic
You have to maintain semantic similarity. So then you will know okay
similarity. So then you will know okay this section talks about bite pair
this section talks about bite pair encoding. This section talks about the
encoding. This section talks about the size of language models. This section
size of language models. This section talks about emergent properties etc.
talks about emergent properties etc. Otherwise there is no way to know from
Otherwise there is no way to know from transcripts. So there are a number of
transcripts. So there are a number of cases when maintaining
the the semantic similarity in one chunk plays
semantic similarity in one chunk plays to our advantage.
to our advantage. And again the drawback of course there
And again the drawback of course there is no free lunch. there is no free lunch
is no free lunch. there is no free lunch and that's why the main drawback is that
and that's why the main drawback is that this kind of a strategy is extremely
this kind of a strategy is extremely complex and it takes a lot of
complex and it takes a lot of computational power because you have to
computational power because you have to convert every single sentence into an
convert every single sentence into an embedding right so that's not easy again
embedding right so that's not easy again major issue is you have a hyperparameter
major issue is you have a hyperparameter here which is the threshold you have a
here which is the threshold you have a hyperparameter here also the number of
hyperparameter here also the number of tokens in a chunk but here you have the
tokens in a chunk but here you have the threshold and you have no clue what this
threshold and you have no clue what this threshold sensitivity should
threshold sensitivity should uh at least this hyperparameter you kind
uh at least this hyperparameter you kind of have an idea that 200 words means
of have an idea that 200 words means this much but here it's completely vague
this much but here it's completely vague another thing is in inconsistent chunk
another thing is in inconsistent chunk sizes so some chunk sizes might be very
sizes so some chunk sizes might be very huge that might be an issue for our LLM
huge that might be an issue for our LLM context etc
context etc let me see if there are any questions in
let me see if there are any questions in the chat
the chat you took one sentence at a time and then
you took one sentence at a time and then I lost how semantics is maintained Do
I lost how semantics is maintained Do you scan the entire document? Yeah. So
you scan the entire document? Yeah. So basically the way it is done samarat is
basically the way it is done samarat is that it done sentence wise. So you take
that it done sentence wise. So you take this sentence number one. Okay you are
this sentence number one. Okay you are doing sentence by sentence. So you take
doing sentence by sentence. So you take the sentence number one you add it to a
the sentence number one you add it to a chunk. You keep on adding sentence
chunk. You keep on adding sentence subsequent sentences to the same chunk
subsequent sentences to the same chunk until their cosine similarity with the
until their cosine similarity with the first chunk with the first sentence is
first chunk with the first sentence is above a certain value. The moment you
above a certain value. The moment you encounter a sentence whose cosign
encounter a sentence whose cosign similarity is not
similarity is not higher than the threshold right from
higher than the threshold right from that moment you start forming the second
that moment you start forming the second chunk then you form the third chunk that
chunk then you form the third chunk that like that you sequentially go through
like that you sequentially go through your entire text and keep on forming
your entire text and keep on forming chunks.
chunks. Does semantic chunking require
Does semantic chunking require premputing embeddings? Is it done at
premputing embeddings? Is it done at runtime? There are both options actually
runtime? There are both options actually uh
uh nowadays actually people have started
nowadays actually people have started using runtime querying so you can do
using runtime querying so you can do that during runtime but most rag
that during runtime but most rag applications I have seen they maintain
applications I have seen they maintain embeddings
embeddings what happens if the idea in chunk one
what happens if the idea in chunk one and again come up somewhere after that's
and again come up somewhere after that's a great question actually uh yeah then
a great question actually uh yeah then unfortunately that needs to be a
unfortunately that needs to be a separate chunk but if the idea is close
separate chunk but if the idea is close to the first idea and if you're
to the first idea and if you're retrieving four or five chunks hopefully
retrieving four or five chunks hopefully both those chunks show up right
both those chunks show up right let's say if you make a chunk which has
let's say if you make a chunk which has certain idea and that idea comes later
certain idea and that idea comes later at the end of the document if both ideas
at the end of the document if both ideas are very similar both those chunks will
are very similar both those chunks will be retrieved at the end
be retrieved at the end wouldn't it be a better strategy to
wouldn't it be a better strategy to check cosine sim it will be I I agree
check cosine sim it will be I I agree but uh
but uh the whole idea is that again the time
the whole idea is that again the time also increases right if you want to
also increases right if you want to check the semantic similarity with all
check the semantic similarity with all the previous sentences.
the previous sentences. It's a bit time consuming also. You kind
It's a bit time consuming also. You kind of hope that the cosine similarity
of hope that the cosine similarity formula is such that if you take
formula is such that if you take dotproduct of two vectors a dot b is
dotproduct of two vectors a dot b is higher and b dot c. So if you think in
higher and b dot c. So if you think in terms of angle if a dot b is higher they
terms of angle if a dot b is higher they are similar. If b dot c is higher b and
are similar. If b dot c is higher b and c are also having similar angles.
c are also having similar angles. Uh so you can say that a and c will also
Uh so you can say that a and c will also be somewhat similar to each other.
Is it based on assumption that next line will be similar to semantically similar
will be similar to semantically similar to yeah uh that is also true that is the
to yeah uh that is also true that is the same thing which was exploited in in
same thing which was exploited in in what's the word to it neighbors usually
what's the word to it neighbors usually carry similar meaning right because you
carry similar meaning right because you would not have random lines usually
would not have random lines usually subsequently added next to each
Samarat has said, "So should we have structure?" Yeah. Yeah. Correct. This
structure?" Yeah. Yeah. Correct. This level of organization which you
level of organization which you mentioned that can also be at a
mentioned that can also be at a paragraph level in semantic chunking.
paragraph level in semantic chunking. If your sentences are not varying too
If your sentences are not varying too much in meaning, you can have one big
much in meaning, you can have one big paragraph as one chunk. But then again
paragraph as one chunk. But then again you will have to do structural chunking
you will have to do structural chunking followed by semantic chunking which is
followed by semantic chunking which is done. I'll come to that later. So that
done. I'll come to that later. So that naturally brings us to actually first
naturally brings us to actually first let me cover structural chunking.
let me cover structural chunking. Structural chunking according to me is
Structural chunking according to me is the most intuitive form of chunking and
the most intuitive form of chunking and this can be com combined with semantic
this can be com combined with semantic chunking also.
chunking also. Structural chunking is essentially like
Structural chunking is essentially like let's say you are considering a
let's say you are considering a shareholder letter, right? So if you see
Yeah, if you take a look at the shareholder letter and they are going to
shareholder letter and they are going to release it quarterly with the same kind
release it quarterly with the same kind of sections. Structural chunking takes
of sections. Structural chunking takes advantage of that approach where it says
advantage of that approach where it says that
that we are going to split the report exactly
we are going to split the report exactly at these section boundaries. The first
at these section boundaries. The first chunk is going to be letter to
chunk is going to be letter to shareholders. The second chunk is going
shareholders. The second chunk is going to be introduction. The third chunk is
to be introduction. The third chunk is going to be company overview. The fourth
going to be company overview. The fourth chunk is going to be financial
chunk is going to be financial statements. The fifth chunk is going to
statements. The fifth chunk is going to be notes to the financial statements.
be notes to the financial statements. Sixth chunk is going to be conclusion
Sixth chunk is going to be conclusion and outlook. That's it. It's extremely
and outlook. That's it. It's extremely simple, right? And believe it or not in
simple, right? And believe it or not in industrial problems structural chunking
industrial problems structural chunking solves many issues because
solves many issues because it depends if you are in a financial
it depends if you are in a financial sector or if you're in a medical sector
sector or if you're in a medical sector and if you are looking at a very
and if you are looking at a very specific rag application
specific rag application it is very likely that the application
it is very likely that the application stays the same across multiple
stays the same across multiple documents. For example, if you're
documents. For example, if you're building a conversational therapist rack
building a conversational therapist rack chatbot, the therapist might be making
chatbot, the therapist might be making notes after each session in a specific
notes after each session in a specific format. So the therapist might be
format. So the therapist might be writing introduction or the key things
writing introduction or the key things which we discussed in the session, key
which we discussed in the session, key takeaways. So as long as you know the
takeaways. So as long as you know the structure of your documents, structural
structure of your documents, structural chunking is the most intuitive and the
chunking is the most intuitive and the best thing you can do when you receive
best thing you can do when you receive any problem. If the problem is a bit
any problem. If the problem is a bit more structured, if it's messy like what
more structured, if it's messy like what we have seen here, then of course it
we have seen here, then of course it will not work. But if let's say if you
will not work. But if let's say if you have hospital records or if you have
have hospital records or if you have stock price information in a specific
stock price information in a specific tabular format or in a specific
tabular format or in a specific structure format, you can always
structure format, you can always leverage that structure. The more you
leverage that structure. The more you leverage structures in a documents, the
leverage structures in a documents, the more grounded your retrieval augmented
more grounded your retrieval augmented generation system is going to be hands
generation system is going to be hands down at all times. So the first strategy
down at all times. So the first strategy which I always intuitively also it comes
which I always intuitively also it comes naturally to me is just go to chunk go
naturally to me is just go to chunk go to structure level chunks right
uh but then what are the issues of structural chunking?
Can you think of any issues with structurebased chunking? In fact, many
structurebased chunking? In fact, many of you when you saw this document, the
of you when you saw this document, the first thing which intuitively came to
first thing which intuitively came to mind is structure based on sections and
mind is structure based on sections and subsections. That's exactly structural
subsections. That's exactly structural based chunking. What are the issues with
based chunking. What are the issues with this?
Yeah, the issue with this is that the u
u one chunk can usually be very large
one chunk can usually be very large because what if in one particular
because what if in one particular shareholder letter the introduction
shareholder letter the introduction section is five times longer than
section is five times longer than others.
others. Then the chunk size becomes very large
Then the chunk size becomes very large then it that chunk will be retrieved to
then it that chunk will be retrieved to the language model right and will be
the language model right and will be added to its context. So then the
added to its context. So then the context window of the language model
context window of the language model becomes again very large. So the same
becomes again very large. So the same problem which we set out to solve, we
problem which we set out to solve, we are again encountering the same issue
are again encountering the same issue again. So the advantage of a structured
again. So the advantage of a structured approach is that it's very good for
approach is that it's very good for documents whose data comes in structured
documents whose data comes in structured format like section, section, subsection
format like section, section, subsection etc. But it's actually
etc. But it's actually not very good in terms of the fact that
not very good in terms of the fact that it can make chunks which are huge and
it can make chunks which are huge and that might increase the context length
that might increase the context length of LLMs and that might again lead to
of LLMs and that might again lead to more hallucinations.
How many of you actually know what metadata is?
uh can you so why have I mentioned metadata over here in structure based
metadata over here in structure based thinking in structure based chunking
thinking in structure based chunking yeah so data about data is metadata
yeah so data about data is metadata right essentially
right essentially if I know that a chunk belongs to a
if I know that a chunk belongs to a particular structure so let's say when I
particular structure so let's say when I store a particular chunk I also store
store a particular chunk I also store its metadata
its metadata that if I store an introduction chunk I
that if I store an introduction chunk I also uh store that it's an introduction
also uh store that it's an introduction chunk
chunk I might refer to it later.
I might refer to it later. So later if I want to collect all the
So later if I want to collect all the introductions, I might refer to this
introductions, I might refer to this metadata also. So structure based
metadata also. So structure based chunking also has this added advantage
chunking also has this added advantage that since you know which chunk
that since you know which chunk corresponds to which structure. For
corresponds to which structure. For example, you know that this chunk
example, you know that this chunk corresponds to company overview. you can
corresponds to company overview. you can store that as a metadata and then you
store that as a metadata and then you can access that metadata later
can access that metadata later downstream in your application if there
downstream in your application if there is a need.
Uh now Samrat had also mentioned that can we in semantic chunking instead of
can we in semantic chunking instead of having sentence as chunks
having sentence as chunks or instead of level of organization at
or instead of level of organization at the sentence level can we have the level
the sentence level can we have the level of organization at a paragraph level. So
of organization at a paragraph level. So one paragraph will be added then the
one paragraph will be added then the semantic similarity with other paragraph
semantic similarity with other paragraph will be compared. So if you want to do
will be compared. So if you want to do that approach you are essentially
that approach you are essentially combining structural chunking with
combining structural chunking with semantic chunking because first you you
semantic chunking because first you you will use structural chunking to find the
will use structural chunking to find the paragraphs
paragraphs then you will use semantic chunking on
then you will use semantic chunking on top of that. So that's a combined
top of that. So that's a combined approach and normally if one type of
approach and normally if one type of chunking fails it's very common to
chunking fails it's very common to combine two types of chunking methods
combine two types of chunking methods also.
also. So this disadvantage of structural
So this disadvantage of structural chunking which we saw the main
chunking which we saw the main disadvantage is that some chunks can be
disadvantage is that some chunks can be too large that is solved by recursive
too large that is solved by recursive chunking.
chunking. Recursive chunking is an amazing
Recursive chunking is an amazing chunking strategy because it's kind of
chunking strategy because it's kind of the best of both worlds. It exploits the
the best of both worlds. It exploits the structure. So it exploits the structure
structure. So it exploits the structure of
of documents
but it also kind of makes sure that chunk size remains consistent.
How does it do it? So let's take a practical example actually.
Yeah, let's take a look at let's say you are building a rag chatbot
let's say you are building a rag chatbot which analyzes research papers.
which analyzes research papers. Now you know that if you are analyzing
Now you know that if you are analyzing research papers belonging to a
research papers belonging to a particular journal, the structure is
particular journal, the structure is going to remain the same, right? If I'm
going to remain the same, right? If I'm looking at patterns, they don't accept
looking at patterns, they don't accept papers if the structure is too
papers if the structure is too different. So, I know that this is going
different. So, I know that this is going to have a
to have a uh I know that this is going to have
uh I know that this is going to have some kind of an introduction section for
some kind of an introduction section for sure. It's going to have a summary
sure. It's going to have a summary summary se section. It's going to have a
summary se section. It's going to have a results section. Uh it's going to have
results section. Uh it's going to have finally the conclusion and discussion
finally the conclusion and discussion section. Yeah. And then towards the end
section. Yeah. And then towards the end there will be references and then it
there will be references and then it will end. You know this is the structure
will end. You know this is the structure but in some papers the result section
but in some papers the result section can be too long compared to other
can be too long compared to other papers. So you cannot use structural
papers. So you cannot use structural chunking. Simple thing is to just use
chunking. Simple thing is to just use structural chunking right and each
structural chunking right and each section can be one chunk. So what
section can be one chunk. So what recursive chunking does is that first I
recursive chunking does is that first I will make chunks based on my sections
will make chunks based on my sections right. So introduction section will be
right. So introduction section will be one chunk, result section will be one
one chunk, result section will be one chunk etc. Then I will look at my chunk
chunk etc. Then I will look at my chunk size and I will define a maximum chunk
size and I will define a maximum chunk size. If the maximum chunk size is 500,
size. If the maximum chunk size is 500, 500 tokens.
500 tokens. If one of my chunks is greater than the
If one of my chunks is greater than the maximum chunk size, I will chunk it
maximum chunk size, I will chunk it again.
again. How will I chunk it again? I'll have to
How will I chunk it again? I'll have to define one more level of chunking. So if
define one more level of chunking. So if the result section becomes too big, I'll
the result section becomes too big, I'll say that I'll chunk it at the paragraph
say that I'll chunk it at the paragraph level.
level. So then I'll again chunk based on
So then I'll again chunk based on paragraphs in the result section. So
paragraphs in the result section. So then each of these paragraphs will then
then each of these paragraphs will then become a separate chunk
become a separate chunk and then I what I do is that then I
and then I what I do is that then I again go to the paragraph level
again go to the paragraph level uh and then I see whether each token is
uh and then I see whether each token is greater than a particular token is
greater than a particular token is greater than my maximum size and if some
greater than my maximum size and if some paragraph like this is greater than the
paragraph like this is greater than the maximum size I will chunk it further to
maximum size I will chunk it further to another level which is my sentence level
another level which is my sentence level and then I will again check whether the
and then I will again check whether the number of tokens are greater or not. So
number of tokens are greater or not. So if you think about it, it's like that
if you think about it, it's like that kind of rush and all approach, right?
kind of rush and all approach, right? Where you take the a large level
Where you take the a large level chunking where you take section level
chunking where you take section level chunking,
chunking, then within that
then within that you have
you have paragraph level chunking.
paragraph level chunking. So you do paragraph level chunking only
So you do paragraph level chunking only when
only when chunk size is greater than the maximum chunk size.
maximum chunk size. So you do paragraph level chunking. Then
So you do paragraph level chunking. Then in paragraph level chunking again if the
in paragraph level chunking again if the chunk size is higher you do your final
chunk size is higher you do your final level of chunking which is your sentence
level of chunking which is your sentence level of chunking.
level of chunking. So here again you check whether the
So here again you check whether the chunk size is greater than the maximum.
chunk size is greater than the maximum. So since we are using different level of
So since we are using different level of chunkings one below each other. This
chunkings one below each other. This method is also called as recursive
method is also called as recursive chunking.
And the reason recursive chunking is the best of both worlds is because it's
best of both worlds is because it's preserving structure for sure, but it's
preserving structure for sure, but it's also making sure that none of my chunks
also making sure that none of my chunks are too large. So that won't affect my
are too large. So that won't affect my context size at all.
Uh let's see what if we comp combine structural with
what if we comp combine structural with semantic. So this we already discussed.
semantic. So this we already discussed. Is it possible to apply chunking
Is it possible to apply chunking strategies to images and videos in
strategies to images and videos in multi- model models? David, that's a
multi- model models? David, that's a great question. Actually, it is
great question. Actually, it is definitely possible to do that. So,
definitely possible to do that. So, think of images and videos in terms of
think of images and videos in terms of tokens, right?
tokens, right? Just like I'm talking about tokens for
Just like I'm talking about tokens for text. Images and videos also have
text. Images and videos also have tokens. They have different token
tokens. They have different token schemes. Uh
schemes. Uh but there are the tokens will be
but there are the tokens will be different. The tokens will be at an
different. The tokens will be at an image level. Uh so there you have to use
image level. Uh so there you have to use again you can use similar strategies but
again you can use similar strategies but the strategies are a bit more different
the strategies are a bit more different than what we are currently covering.
than what we are currently covering. Can you define chunk size? Yeah. Yeah.
Can you define chunk size? Yeah. Yeah. So basically
So basically one hyperparameter we have to define
one hyperparameter we have to define here is that
here is that I will define a maximum chunk size
I will define a maximum chunk size myself.
myself. Before I do recursive chunking I have to
Before I do recursive chunking I have to define a maximum chunk size. Let's say
define a maximum chunk size. Let's say that's going to be 500 tokens.
So at every stage I'm going to compare whether my chunks are greater than this
whether my chunks are greater than this size or not. So if I do section level
size or not. So if I do section level chunking each section chunk I will check
chunking each section chunk I will check its number of tokens. If it's greater
its number of tokens. If it's greater than 500 I'll do the second level of
than 500 I'll do the second level of recursive chunking which is paragraph.
recursive chunking which is paragraph. Then again if it's greater than 500 I'll
Then again if it's greater than 500 I'll do sentence level of chunking.
How is this different uh from fixed size chunking?
chunking? So it's completely different than fixed
So it's completely different than fixed size chunking, right? Because in fixed
size chunking, right? Because in fixed size chunking nowhere we are thinking
size chunking nowhere we are thinking about the structure.
about the structure. In fixed size chunking I will just start
In fixed size chunking I will just start randomly from my start and I will if my
randomly from my start and I will if my fixed size is 50 tokens I will take this
fixed size is 50 tokens I will take this 50 that will be my one chunk. I'll take
50 that will be my one chunk. I'll take this 50 that will be my second chunk.
this 50 that will be my second chunk. I'll take this 50 that will be my third
I'll take this 50 that will be my third chunk. Here what we are doing is that we
chunk. Here what we are doing is that we are doing structural chunking first. So
are doing structural chunking first. So first we break it down into sections. If
first we break it down into sections. If each section does not have too many
each section does not have too many characters or tokens then our chunking
characters or tokens then our chunking will be at the section level. Only if
will be at the section level. Only if one section is larger than a token size.
one section is larger than a token size. We will break it down further.
We will break it down further. Did everyone understand how this is
Did everyone understand how this is different from fixed size chunking?
different from fixed size chunking? Recursive chunking is actually
Recursive chunking is actually completely different than fixed size
completely different than fixed size chunking.
chunking. There is no similarity at all between
There is no similarity at all between recursive chunking and fixed size
recursive chunking and fixed size chunking because in recursive chunking
chunking because in recursive chunking we are not mentioning the size which we
we are not mentioning the size which we want. We are just mentioning the maximum
want. We are just mentioning the maximum chunk size.
There is a question how is the link saved in this chunk? It's not in
saved in this chunk? It's not in structural based chunking and recursive
structural based chunking and recursive chunking. The semantic notion is not
chunking. The semantic notion is not maintained here at all.
Are there libraries to do? Yes, definitely there are libraries. Both
definitely there are libraries. Both lang chain and langraph provide
lang chain and langraph provide libraries to do recursive and structural
libraries to do recursive and structural chunking. But today we are going to
chunking. But today we are going to implement these from scratch in Google
implement these from scratch in Google Collab. We are going to implement all of
Collab. We are going to implement all of these chunking strategies from scratch.
these chunking strategies from scratch. So amit it is not actually maintaining
So amit it is not actually maintaining semantics because nowhere does it know
semantics because nowhere does it know what is mentioned in the section or
what is mentioned in the section or subsection or paragraph or sentence.
subsection or paragraph or sentence. So to those people who asked the
So to those people who asked the question about fixed size versus
question about fixed size versus recursive chunking is it clear how it is
recursive chunking is it clear how it is different? I think sankit asked and
different? I think sankit asked and Krishna also asked if that is your main
Krishna also asked if that is your main question it means there is some
question it means there is some conceptual gap.
conceptual gap. If there is no link, I might well as
If there is no link, I might well as look at the but there is a the the the
look at the but there is a the the the section is maintained, right?
section is maintained, right? So you understand the benefits of
So you understand the benefits of structural chunking
structural chunking the sections are maintained. So think of
the sections are maintained. So think of recursive chunking as a supererset of uh
recursive chunking as a supererset of uh structural chunking. Which means that if
structural chunking. Which means that if you understand the benefits of
you understand the benefits of structural chunking by default you
structural chunking by default you already understand the benefits of
already understand the benefits of recurs recursive chunking
recurs recursive chunking because it is structural chunking but it
because it is structural chunking but it is a bit more clever because it ensures
is a bit more clever because it ensures that each chunk is not greater than a
that each chunk is not greater than a particular size.