Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Production level RAG Workshop: Part 1 | Vizuara | YouTubeToText
YouTube Transcript: Production level RAG Workshop: Part 1
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This workshop introduces Retrieval Augmented Generation (RAG) by building a practical, from-scratch pipeline. It emphasizes understanding the underlying engineering trade-offs and complexities beyond introductory tutorials, aiming to equip participants with the knowledge to make informed decisions in real-world RAG implementations.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Okay, so let's get started with today's
workshop. I'm really excited for this
because uh I have planned for this
workshop for quite some time and this is
the first time I'm actually conducting
this live. So we have two sets of people
uh who are in the participants today.
One is who are already present in our
live classes and uh one who are just
specifically attending this workshop. So
So
I'm still admitting a few participants.
Yeah. And regarding the lecture
recordings, the I'm recording today's
lecture and tomorrow's lecture as well.
And we'll be immediately sharing with
all of you once the recording is done.
So no need to worry about receiving the
recording after the lecture. For the
people enrolled in the live classes,
I'll upload it to the dashboard as
always. And for the others, I'll send
you through email.
Uh so the reason I thought of conducting
this workshop is because the way I have
also thought about retrieval augmented
generation or rag has changed a lot over
the last I would say um last two years.
There is a question do we need to code
along with you? Yeah. Yeah definitely
will need to code. So I would highly
recommend not attending this workshop on
phone because
uh it's just not a good experience.
We'll be coding everything from scratch.
So u if you attend on phone then um
we'll be doing everything on Google collab.
collab.
So let me get started with the lecture
objectives. And when I say lecture,
we'll have two lectures actually within
this workshop.
Uh I'll first tell you what all we will
try to accomplish and then we'll start
going through each and every single
thing in detail. So I'll tell you my RAG
journey. Right? So RAG stands for
retrieval augmented generation. No need
to be scared by this name. The name
itself looks a bit complex like retrieval
U we'll see what all of these words
actually mean. But before that I'll tell
you my u
uh my experience with rag. So there are
some questions in the chat. What do you
mean by students attending live classes?
So there is a live batch which is going
on in the hands-on LLM series where we
have this is in the middle of our course.
course.
I will cover agentic rack but not in
this workshop in our subsequent lectures
of the live classes
Okay. So if you take a look at rack
tutorials, you'll see that there are
number of tutorials which pop up and
which are small. So there are some
tutorials which are 10 to 15 minutes in
nature. There are some tutorials which
are just 5 minutes. Uh there are
actually rack tutorials which teach you
how to build a chatbot in 5 minutes.
Then there are these 20inut tutorials,
25 minute tutorials. And uh when you
watch these tutorials, you feel that
okay this is simple. I have understood
what is retrieval augmented generation.
Um but it's actually not the case. Only
when I started solving industrial
problems then I realized that the whole
pipeline is actually far more
complicated than what is shown in these
introductory videos. There are several
things which no one ever talks about.
Right? So for example uh chunking.
Chunking is very briefly mentioned in
introductory videos, but no one codes
through chunking and actually teaches
engineers which chunks to use at what
time. If you don't know these
terminologies, don't worry. I'm going to
cover every single aspect in detail.
Then second is file parsing. In most of
these tutorials, it's already assumed
that you have the file, but in fact,
that's one of the most important steps
and frankly quite quite challenging.
Then another aspect which is neglected
is evaluation
uh which I'm calling as evals and in
industrial settings has become critical
because okay you build a rack pipeline
and you submit it to the client or for
your internal workflow but is it working
or not? How are you continuously
monitoring whether your rack pipeline is
delivering good results or not? And more
importantly, there is the question of
embeddings. Right? In all of these short
tutorials, they randomly use vector
databases or vector stores without even
thinking about why do we need to use
vector stores? Can we just do embeddings
in PyTorch? And what are vector stores?
Uh we are going to see all of these
today. In fact, at several portions of
this tutorial, I'm going to have this
section called engineer's choice.
uh this engineer's choice section I
specifically curated based on my own
industrial experience. So unlike all of
these tutorials, I don't want to tell
you that go ahead and use this, go ahead
and use that. But I'll be making you
aware of the trade-offs. Um and when I
say trade-offs, I mean how you should
select the tool for your particular use
case. So my goal after this workshop is
that when you face these trade-offs in
industry or wherever you want to
implement it, you should be in a
position to decide what's the best tool
for me.
I'm going to show you the different
trade-offs which I encounter in our
industrial problems.
Um and then so we are going to assemble
a whole rack pipeline from scratch. And
when I say from scratch, we are not
going to use a library like lang chain
or lang graph today or tomorrow because
all of that will seem very simple to you
after going through this workshop. We
are going to code everything from the
ground up.
Um and while doing that we'll see what
are the different engineering choices
you need to make. While doing that I'll
also show you the different packages and
libraries which are coming up which are
useful in industrial settings. So look
at this workshop not as a toy series but
rather as an industrial level workshop
where in when you go to industry things
are not white and black right they are
gray mostly there is no right solution
but the engineer who stands out is the
one who can figure out what's the best solution
solution
for the given problem that's what I want
to teach you so if you have any
questions at any aspect right ask me my
goal is for you to understand the nuts
and bolts of this rack in detail. So it
should not just be a terminology which
you think that rag okay rag is easy I
can cover it in 10 minutes
after this all of you will be able to
build chat bots and hopefully you will
be able to understand the trade-offs
when we build pipelines.
So what's our end goal? Our end goal
after this workshop is that we want to
build an application such as this where
uh if you see this is a rag based
nutritional chatbot and it's built
entirely from scratch.
There are also two types of rag systems.
One who directly provide the answer and
one type of a system which when it
provides the answer it also provides the citations.
citations.
So we are also going to look at how to
provide references, how to provide
citations and and what does it mean when
it says 56% match or 55% match. The
front end we will not spend too much
time on the back end and front end
coding. We are going to do it through
lovable. So all of us might get
different looking websites at the end of
this workshop but that will be the fun
of it right. We'll we'll share the
websites which all of us have obtained.
U yeah and so the lecture will be such
that I will explain many aspects through
a whiteboard
and then there are several code files
which I have designed. All of these code
files I'll share with you at
specific intervals
within this workshop. All of them will
be on Google collab. Tomorrow we are
going to use
um tools. The only external tools we'll
need is Superbase. How many of you have
heard about Superbase by the way or used
it before? It's fine if you have not
heard of this tool. I'm going to show
you what it is because it's used it a
used a lot in production level settings
these days. Uh Superbase is one tool
we'll need. And second is lovable.
for everything else. I believe that even
if you have the T4 GPU which is
provided for free through Google Collab,
you'll be able to follow along in this workshop.
Uh as the
fundamentals or as the I should say
guiding principle of
most of this lecture.
uh the prompt engineering
rules which we saw in some of our
previous lectures are going to be important.
important.
So let me just introduce the seven key
elements of writing an effective prompt.
First is so the rule is called P IC F A
TD which is basically when a prompt you
have to define many things instead of just
just
u writing a quick prompt. The first is
persona then you have the identity then
you have the instruction then you have context
context
then you have format then you have the
audience the tone and finally the data.
These are the seven key elements
u of an ideal prompt and we are going to
use this in when designing rack
pipelines. In fact, the base of
everything which is to follow such as
RAG, then later we'll look at agentic
workshops, we'll look at MCP. The base
of all of that is a good prompt.
U so please keep these seven things in
mind when writing an effective prompt.
Don't just write something quick. And
I'll stress that today when we are
building this uh this chatbot pro this
chatbot project also it is extremely
important that you spend spend time
writing a prompt because think about 5
years into the future right if English
is going to be the new programming
language then prompt engineering
then I'll take some questions in the
chat but before that let me tell the
philosophy of this workshop the The way
I have designed this workshop is that it
will cover all these three aspects. Uh
it will cover foundations that is the
most important according to me where you
should know the nuts and bolts of the
entire rack pipeline and you should be
able to make engineering decisions when
the time comes. Second is practicals. So
at various places I'm going to give you
practical insights uh regarding which
chunking strategy to use, which
embedding strategy to use, how to deploy
the rack project that comes in
practicals and then finally I'm going to
leave you with research questions or
research directions. So after this
workshop, I'm going to show you what are
the open research problems in this area
which you can now immediately start
working on uh after these live sessions
are done.
There are let me take questions in the
chat. So Prashant has asked this might
be a question for later but should
everyone move from plain vanilla rack to
agentic rag? Uh in production setup we
are only seeing 55% accuracy with
standard rag. That's a good question
Prashant. So I'll tell you my experience
from industry right so far we have done
around 16 industrial projects. out of
those 10 projects have been rag based
projects and in that we have been able
to satisfy the customer with a pure rag
pipeline and when I say vanilla rag it's
not just a simple you upload a PDF you
query from that PDF and you give it to
the LLM in that pipeline which we
designed for the customer we did not use
agents but we used many modern things
which grounds the responses I'm going to
share that that knowledge also So today
but vanilla rag is working for problems
which are not too complex in my opinion.
So let's say an industry comes there are
many chatbot requirements in industry
all of those can be solved with
vanilla rack then not just chatbot there
are some level two requirements which is
basically a company wants to make a code
generator based on their docs that can
also be solved by rag.
Agentic rag
plays a very crucial role when you want
access to external tools or when you
want to do something complex. Let's say
a company wants to build its own deep
research agent.
uh that is a difficult thing and their
traditional rag won't play an important
role but at least in my experience over
the last one year the one of the main
reasons I thought of taking this
workshop is rag is still very relevant
and vanilla rag for many company
problems of level one and level two
let's say chatbot generation uh code
generation based on what they have etc
then another question in the chat is
when will the lecture notes and uh
recordings we uploaded. So lecture notes
and the recordings I'll share after each
lecture is done. So after the first
lecture I'll share it on each
participant's email the link to the
whiteboard notes and the link to the recording.
Agentic rag is about yeah agentic rag is
basically I I'll explain that to you but
in a rag pipeline you have access to
embeddings right so think of embedding
store as a tool
if you start thinking of embedding store
as a tool then it suddenly becomes an
agentic pipeline where along with all
the other tools the agent also has
access to the embedding store or the
vector store.
Would you discuss query transformations?
Yes, I will discuss that towards the end
of this workshop.
Plain rags would they help in
application modernization like smaller?
Yeah. Yeah, definitely. If you have
let's say one common application which I
would like to share with all of you is
ITSM tools. How many of you are aware of
In information technology service
management, right? So if you look at at
least India, there is a whole middle
layer of companies which operate in the
ITSM space where they provide
uh let's say you are on Razer Pay and
you make a payment through Razer Pay.
What happens on their back end? How is
the payment stored and processed? That's
essentially information technology
service management. If you are on book
my show and you if you book a movie
ticket what happens on book my show
server how are they managing different
clients which are booking that's usually
managed through an ITM company so these
companies usually provide dashboards to
players like Zomato book my show players
which need an IT infrastructure those
dashboards are based on legacy systems
or traditional pipelines for a very long
time and now they want to integrate chat
bots within those dashboards
For these type of integrations, rack
systems will still play a very crucial
and an important role
because they usually have a fixed data
Llama index. Yeah. So, llama index and
lang chain, lang graph, all of them can
implement rack pipelines very easily.
And if I were to make a tutorial on
llama index using rack that will
probably be a 25 30 minutes tutorial.
But my main aim here is to build such a
strong foundation that after this any
tool will seem very simple to you.
Whether it's llama index, langraph, lang
Um so let's get started now. So I've
used this terminology rag many times so
far, right? And those of you who don't
know or have not heard of this do not
worry about it. Uh I'm going to motivate
it in a lot of detail.
So for the purpose of this workshop we
are going to consider imagine that we
are in the nutritional domain. So the
document which we are going to consider
is this 1200page document on human
nutrition and I'm going to share this uh
uh
this drive link right now with all of you.
We will see what these different things
are. We don't need them right now for
now. All you can do is that when I'm
showing this PDF to you, you can go
ahead and download this PDF from this
drive link which I have just shared on
the chat so that you can also refer to
this PDF along with me as I go along.
Now I want to ask all of you a question
right? Imagine that you are working in
an industry
uh you are in the engineering team and
you are in the you are in a meeting with
a client. The client has started a
nutritional startup, right? And they
want to spread awareness about nutrition
globally and for that they want to make
a chatbot.
Okay? And they want to make a chatbot
which looks something like this. So
essentially a customer will come, a
customer will log in and a customer will
ask some questions and the answer which
will be generated
has to be very specific and has to be
very grounded. I'll use this term
grounded a lot. What does grounded mean?
Grounded with whenever someone says
grounded, you should ask grounded with
respect to what? So this startup wants
its answers grounded with respect to the
encyclopedia of knowledge which it has
which is basically this book for now.
It's a 1200page PDF
uh about human nutrition 2020 edition
and it talks about a huge number of
things starting from basic concepts in
nutrition to uh human body then it goes
to water and electrolytes whatever it
covers every single thing about
nutrition they want their answers
now this same example I'm taking of
human nutrition you translate it to
other domains as well. If you want to
make a chatbot for customer service,
there will be a manual for customer
questions and what should be the ideal answer.
answer.
Uh if you are making a chatbot for this
ITSM, there will be a question, there
will be a manual of tickets which
customers usually raise and a sample of
the solutions. Now my question to all of
you is that let's say you are sitting in
that meeting as an engineer, right? And
this client comes to you with this
request of making a chatbot. Forget
about rag or this terminology of
retrieval augmented generation. Let's
think from first principles. How will
How how exactly will you build this
that's the goal, right?
The goal is to build a nutritional chatbot.
chatbot.
But what's the
key terminology which I mentioned? It
should be grounded.
It should be grounded in factual
knowledge based on the book which I just
shared with all of you. How will you do
that? Add the document content somehow
as a prompt. Use the PDF and pass it to
the LLM.
Okay. So what Madusan has mentioned
let's say you don't know all of this
there that is the rack pipeline you are
mentioning I'm I'm asking you to think
from first principles where forget all
of your knowledge let's say the only
thing which you have let's say is that
you have a chat GPT right you have
access to chat GPT or any LLM for that matter
matter
let's say you have this and that's the
add the PDF in the context of the LLM.
Instruct to answer information found in
the PDF. We will load data to chat GPT.
Okay, so the simplest thing which many
people are saying and which is aligned
to the data portion of this prompt,
right? I showed you this seven steps of
the prompt and there is this data
portion over here which is where you
usually feed the data and which is where
we usually ask the question.
So many people are saying that okay this
seems like a simple enough task why not
I just use
there is also an answer from Prashant
about keyword based search
so keyword based search okay that can be
done so but you want to use a modern
approach so you propose to the client
that hey this seems like an easy thing
to do we just make a front end
we just make a front end
and that front end looks something like
this. This is the human query. This is
the answer. This is the human query.
This is the answer. So human query I'm
mentioning by HQ and this is the answer
human query and HQ.
So you you start thinking from the front
end then you think that whenever the
human query is asked you pass it to an
LLM directly. Whenever a human query is
asked, you pass it to an LLM
like chat GPT and along with this you
also pass the PDF
Then you make this API call to the LLM
That answer now you show in the front end.
end.
Then the user asks another query. You
again make an API call to the LLM. You
again pass the entire PDF in the context
of the LLM and you get the answer.
That's what would have naturally come to
my mind if I were thinking from first
principles and if I did not know
anything about uh
uh
retrieval augmented generation really.
But what are the issues with this approach?
approach?
Can you try to think that as an engineer
you go back and you try to implement
this now what will be the issues with
this approach so Amit is saying high
cost Samarat is saying too many tokens
context length too many tokens so let's
actually see this right and I encourage
all of you to do this in practice go to
chat GPT
I went to chat GPT right now and I have
asked I put this exact same PDF and I
asked what are the number of tokens in
this document
What are the number of tokens in this
document? Does it fit your context
window? What is context window? Context
window is the number of tokens which a
language model can look at at one time
before producing an answer. So think of
it something like imagine
you are being bombarded with information.
information.
Let's say someone tells you about one
topic then the lecture goes on for 2
hours, 3 hours, 4 hours, 5 hours. At
some point you will start losing
information right context window is the
amount maximum amount of information you
can fit in at one time and produce
coherent answers.
Now whenever an LLM like GP is designed
the context window is fixed.
Um so if you ask something if you put
this document and if I ask what are the
number of tokens in this document does
it fit your context window. So number of
and the context window of chat GP and
I'm using GPT5 here it's around 128K.
So here we see that the entire document
does not fit into memory at once.
And what will happen if the entire
document does not fit into memory? If
you ask a question let's say related to
uh if you ask a question if the human
asks a question related to this chapter
nutritional issues and if in the context only
only
tokens up till page number 700 are going
to be filled then there is the entire
context is lost the LLM will not be able
to answer correctly answers will be
wrong and then what will the LLM do you
know what the LLM will do. After this point,
the LLM might start to answer from its
own knowledge of pre-trained
information. The LLM says that this this
document does not fit in my context
window and I don't see the relevant text
in my context. So, I'll use my own
pre-training data. That's fine. I don't
need to rely on any document. And when
an LLM becomes overconfident like that
and starts thinking like I have data
from my own knowledge or my own corpus
that leads to one of the major problems
which retrieval augmented generation
actually solved and that's the problem
did not fully solve it but was a good
step in that direction. So if you if you
pass the entire PDF at once it might
exceed the context window of language
models and that might lead to hallucinations.
hallucinations.
What is the solution to this problem?
The solution to this problem came with
this paper which was released in 2021
and the solution to this problem is
Now
you can read through this paper
definitely but the idea of retrieval
augmented generation is very similar to
an example which you you you all know.
Let's say you you have also been given
this text. You have been given this text
on human nutrition
and you have an exam but that's an open
You have an open book exam. So let's say
I hope all of you know what an open book
exam is right in an open book exam you
you can put the book in front of you and
you have access to all this material.
So you have access to the entire book
actually in an open in an open book
exam. So you are sitting in that lecture
hall and you see a question you see a
question related to
proteins let's say
how will you answer this question at
that point can all of you try to think
about it if you are sitting in that open
book exam and you have been asked a
yeah go to index find the topic
and find it in chapters index. Yeah. So
what all of you will probably do is that
you will you will look at this word
proteins then you will go to this PDF
from start. You will maybe look at uh
the chapter of index or table of
content. If it's not there in the table
of content you will go through all the
pages and you will try to find that page
where this particular information shows up.
up.
Then you will highlight that
you will highlight that information. You
will use this knowledge. So the question
which might have been asked might not be
completely related to this information
but you will use that information.
from the book. And plus another key
component which of course you need is
So your own mind already has some
information right because you might have
studied for this exam. On top of that
you will get some information exactly
based on this book contents and then you
will get the answer.
Now this entire pipeline is very similar
to what a retrieval augmented generation
is. retrieval part is this part retrieval
and generation part is this part.
So if you are not fetching context from
this book and if this whole thing was
not there. So let's say
I will move my screen now to this.
Let's say if only this was there. That's
just the generation part. Right? But now
you have augmented the generation part
with some sort of retrieval from this document.
document.
That's where the term retrieval
augmented generation actually
There is a question on the chat that do
you plan to share your screen? Is my
It's visible, right? Okay. So, I guess
it was frozen for some time whenever I
go to this prompt engineering book. So
Yeah, now I'm back to my main screen. So
we retrieved context from this document
and we also generated answer from our
own own mind. That's retrieval augmented
generation. How does it translate in the
case of the startup app which we
discussed the mind here which I've
mentioned is the LLM with its own
pre-trained knowledge and instead of
passing the entire context to the LLM we
only pass context which is relevant and
instead of using the word pass a more
fancy word is retrieve we only retrieve
context from that PDF which is relevant.
So instead of this earlier pipeline
which we saw, what if we make a
different pipeline? What if the pipeline
now is something like this? We still
have our front end and this is the human
question and this is the answer, right?
So when the human asks a question, it
will again go to the LLM. That is fine.
But the LLM will also somehow magically
get only that piece of context which is relevant.
relevant.
And now that's the retrieval part. This
relevant context is passed to the LLM
Do you see the problem which this will
solve? We we started out with the
context problem. Right now we don't have
to pass the entire PDF into the context.
We only pass the relevant bits of
information. What are the relevant bits?
the same bits which as a student we
highlighted over here when we were doing
this open book exam that's the relevant
bit of information which is passed to
the LLM so the context window problem
will be solved the natural consequence
of the context window problem being
solved is that the LLM will now produce
answers which are more factual the LLM
will now produce answers which are more grounded
grounded
uh in reality based on the exact
document which the client has given to
me based on this exact document which
the client has shared with me my answers
now I can be sure that they can they
will be specifically tailored. So when I
ask some question here
and when you see the answer being
printed on the screen you will also see citations.
citations.
Yeah. So these citations actually refer
to what portion of the document
this generation is referring to. See
this this thing directly comes from the
document itself on page number 592.
This comes directly from the document on
page number 53.
So you are retrieving relevant pieces
from the document from various places.
It does not need to be from one place.
You are passing it to the context of the
LLM and then you are generating the answer.
Okay. So that's the whole concept of uh
uh that's the whole concept of rag. So
if you have any questions please ask
I'll be taking all the questions through
the chat since the size of the room is
quite large. This is the whole concept.
So I just want to make sure the ground I
mean the stage is clear when we move to
the next part. So one of the teaching
philosophies which I follow is that
before explaining anything you need to
understand the context behind it. So I
know many of you might be wondering
about the details here right like how do
we get the relevant context
um which LLM are we going to use? Are
you doing using open API key? So let me
LLM I'll come to that. That's again an
engineer's choice. We can use an open
source LLM and a closed source LLM. I'm
going to do both. I'm going to use an
open-source LLM. So, we are going to
deploy a local rack pipeline
and I'm going to use a closed source LLM also.
Uh Jiny Gems uses rag. Yes. In fact,
many of these players like Perplexity,
they have a rag pipeline underneath.
There is a question by sankit. What if
the question is like summarizing the
whole document? Wouldn't it have to
parse the entire doc? Yeah. So for
summary there are multiple other things
which we can do. For example, if you go
to Gemini right uh and I also encourage
all of you to try this.
How many of you are aware of the context
You must be aware of this right. So
actually what I did is along with chat
GPT you pass the same document to Gemini
and what Gemini says is that this
document does fall within my context
window because apparently it has context
window of the size of millions.
Now for Gemini such a thing might
actually work because the context window
is very large
and there are many reasons why how
Gemini has improved its context window.
If anyone of you is interested in that,
uh I think the answer lies in this blog
uh which is also a book by the way.
Yeah, just check this if anyone is
interested in that. Anyways, that was a
digression. There are multiple questions
in the chat related to retrieval from
multiple documents. So whatever I have
shown you right now, right, it's just
one document. You can retrieve from as
many documents as you want. It does not
need to be constricted to a single document.
document.
uh I'll and then there is a question
that where do we can we retrieve from a
database or other format like images we
can I'm going to come to that when I
come to the injection pipeline
data injection pipeline so Samir has
asked hallucination is due to large
context so hallucination can happen due
to multiple things in this case there
will definitely be hallucination because
of large context because the whole PDF
will not fit in the context window so
The LLM will have to rely on its own
pre-trained knowledge. So the answers
which it generates won't be grounded to
this document. That's why we call it as
In case of rag application, how
important is the quality of LLM?
Extremely important in fact. But again
there is a trade-off here Amit. What is
the trade-off? The trade-off is with
respect to what the organization values.
If the organization values privacy, you
want to have an open-source LLM on your
own server. Uh we are in fact going to
use an open source LLM on our local GPU.
I I will come to the trade-offs when we
come to the engineer's choice section in
How important is quality? So this I
already answered is the data if the data
is in tabular form. I'll come to data
part just right now. So all of you who
have questions about the data format,
that's the next point which I'm coming to.
to.
Does rag help in improving named entity
recognition? 100% it does. Um
and in fact for named entity recognition
you have to do chunking in a very
specific manner. We did an industrial
project recently which had named entity
recognition. For that you'll have to do
chunking which is called as structural
Okay. So there are many questions which
I will slowly start answering. Many of
these questions will become clearer. But
one thing which I do want to address is
what was rag in 2021 and what is RAG
now. So in 2021 retrieval augmented
generation was this cool new thing which
had come to prevent hallucinations and
it's still relevant
because it still solves industrial
problems. But now just zoom out a bit
and take a look at retrieval augmented
generation in context of something which
is called context engineering. So now
there is this new field which is
emerging which is called context engineering.
engineering.
We talked about context a lot right in
rag and already I mentioned that as the
context window of LLM is increasing
for example what if the context window
of all LMS becomes 5 million
it might happen that in the next 2 years
why because you can just pass the entire
PDF to the LLM
again there is a trade-off like even in
Gemini I would not do this. Why I would
not do this? Because Gemini charges you
per token in the input and per token in
the output. If you pass let's say 100
PDFs, even if the context window is
large, you will incur that much
prohibitive cost. Right? So even if the
context window of LLMs becomes large,
rag will still be valuable to reduce costs.
costs.
Although LLM can handle it from a
performance point of view, it's still
not in your best interest to pass the
full document. It's like using an
elephant to kill an ant.
Although you can do it, that does not
mean you should do it. It will be costly
for you for every single request. Why do
you want to pass all the documents?
You'll get charged per token.
U but now think of rag in the context in
this within this umbrella of context
engineering. In the last class we have
discussed prompt engineering right
that's intimately connected with rag and
one more thing which is intimately
connected with both of this is memory
um essentially if you are interacting
with this chatbot this chatbot which I
showed to all of you
if I'm interacting with the chatbot
let's say user logs out and comes the
next day how does the LLM know what
conversation has happened yesterday
imagine that I go to a nutritionist,
right? I go to a nutritionist and I ask
a question or I ask multiple questions.
I have a 1 hour session and I go back
again the next day. The nutritionist
will of course remember the thread of
our previous conversation
or the therapist if you go to a
therapist they of course have to
remember what has happened in the past.
So when you talk about context
engineering memory becomes a very
crucial role. Here also there is a
trade-off. The more memory you save for
an LLM, the more context it has, the
more again cost. The cost increases, the
context size increases.
But when you think about rag these days,
you have to think in terms of what's the
context window of the LLM, do I really
need rag? Okay, if the context window is
large enough like Gemini, I don't really
need rag, but I I still can do it to
save costs, then how much costs can I
save? Why have I mentioned prompt
engineering here? Because the success of
your rack pipeline also depends on how
Sanjiv is saying can you explain context
engineering? Yeah. So if you think about
context engineering
the best way to explain context
engineering is if you want to make a
production level app like a rack chatbot
how are you going to manage different
aspects which show up in the context.
What are the different aspects? One is
of course the information retrieved from
rag. One is the memory. One is your
current state. Then second where are you
going to save this context? Are you
going to save it in a vector database or
are you going to save it in a normal
database like Postgress? Where are you
going to save embeddings?
I'm I will come to most of these issues
in this workshop. But context
engineering is a much broader field now
and 2025 rag has evolved now in the in
four years. Now we think about rag in
terms of context engineering. The main
field is context engineering and then we
start to think that okay with this
context of the LLM that's this is the
application what's the best thing which
I can do should I do rag uh should I
just do few short prompting
by passing the whole PDF how will I how
am I going to save memory can I save
should I save all the conversations as
it is or should I save a summary of the conversations
conversations
uh think about this right if you talk
with someone
and if the conversation is for 1 hour
after 1 hour what do you remember you
don't remember exactly what that person
said right you remember the summary of
key points which your mind automatically
forms so you can use another LLM to summarize
summarize
so it is best practices around context
yeah And when you say context it means
many things. It means memory. It means
will rag be relevant in the long run
lms themselves. Yeah. So this I think I
already answered Amit. Let's say if you
are JP Morgan, right?
Uh and if you want to make a chatbot
specific to your data, I think rag will
still be relevant because passing the
entire PDF each time will be
computationally and costwise prohibitive.
Okay. So let's table the questions for
some time because now what we have to do
is that we have to get started with the
first pipeline.
Many people have asked questions about
document prep-processing and I want to
spend some time here.
Uh this is the whole pipeline which we
are going to build in this workshop. By
the way,
let me walk you quickly through
different elements. We are going to
start with this nutritional PDF. Then we
are going to do chunking. So this is the
I'm going to have a section on this.
Then we'll have a section on chunking.
Then we have a whole section on embedding.
Then we have a whole another section on
LLMs whether open source LLM or close
source LLMs.
And then finally we'll put all of this
together and run everything on a local
GPU. After this is done, we will do
production level rag and build this
website. So we do have a number of
things to cover. uh the pace at which we
are going I'm not really sure how much
time uh this workshop is going to take.
I'm I'm very happy to answer all the
questions but from your side also
uh please note that it may take more
than 3 hours
because we have to do all of these
parts. So let's see actually let's let's
let's take a call based on how much we
cover today and how much we are able to
cover tomorrow.
Okay. So the first step is data
injection right and this is often the
most neglected step in tutorials in
video sessions everywhere because it's
not very cool. When I say cool, everyone
talks about embeddings, LLMs, but the
part which I think many should
definitely be talking about is how are
you going to collect the data and how
are you going to store the data?
Um, so let me ask all of you right if
you have this PDF, how will you collect
this PDF so that a Python interpreter
knows what to do with it? How will you
open this PDF and how will you read this
PDF in code?
So we need to store ingest the data
somewhere. Our LLM is going to look at
that data and then answer questions.
But right now it's in PDF format. We
humans can see it, right? But a Python
code needs to understand it.
Uh so someone is saying PDF to text
pain of document parsing.
Yeah. So this point pre frame which you
have mentioned I'll come to that. Yeah
we will use a Python library basically
to do document pre-processing. What
document pre-processing is is
essentially downloading and reading
PDFs. Now in this section I want to talk
about three things. I want to talk about
the document which only has text.
Then I also want to talk about documents
And I also want to talk about documents
So if you have a simple document right
which is in a PDF format you can use packages
packages
to load the document and one popular
package is pi mu PDF. So I'm going to
show you several packages and the way we
decide for problems which one to use. So
this workflow which I'm giving you right
now is exactly what we do internally
when a problem comes. So check this package.
Actually let me show the GitHub version
of this package.
So Pyu PDF is a traditional Python
library for data extraction of PDF
documents. really very robust library.
What it does is that using this library
you can essentially pass in any PDF.
U you can pass in any PDF and then you
can open the PDF using py mu PDF.
Um using this library you can also read
different data.
When I say read different data you can
read different pages. So for example
this entire PDF can be ingested by this
library and then we can save what
information is there in every page. Now
let me ask you this question.
Uh let's say this image comes this image
comes what do you think this PDF
extraction library will do at this point?
So all text skip it. So I'm looking for
a specific answer. So first let me ask
whether this it will be able to deal
uh those who are answering no that is
not the correct answer.
Uh it will be able to deal with this
image because this is a digital image.
So there will be an image tag associated
with this image and this image will be
downloaded as an image format. But the
Let's say you get an image like let's
say there is a restaurant bill and
someone takes a photo of this and uploads
uploads uh
uh
uploads somewhere. Let's say someone
takes a photo of this and uploads. Will
the library I'm showing with you deal
It will not read this type of im this
type of an image and that is one key
thing to understand. I mean it will save
this entire thing as an image but it
will not read what is the text present
on this image unless the text is typed
through a digital form. If the bill is
generated through a digital software
and every field which is entered is a
digital field
uh that will be taken into account by a
tool like this.
But if you have a image which has some characters
characters
so what do you mean by digital? By
digital I mean if you uh let's say if
yeah if I go to an invoice invoice
software tool
I I fill entries over here and I
generate PDF from this tool that will be
digital entry because every number is digitized
digitized
then a standard PDF extractor can also
read that number can also see what is
mentioned here if it's digital we can
copy text from the PDF but if it's not
digital like this we cannot copy text
from this PDF or at least normal Python
libraries cannot that is where we need
libraries which can deal with something
called OCR
so the best open-source OCR library you
can also see from the number of stars
which it has is tesseract
uh tesseract is one of the most popular
OCR libraries but before introducing
this libraries all of you should know
why OCR is needed in the first place for
the current PDF which we have many of
you gave the wrong answer here right we
don't need OCR library here this is just
a simple image there is no text on this
image it a simple PDF extractor can deal
with this image you need to use OCR
libraries only when you have images
which which are let's say handwritten
text images which have been scanned and
uploaded into a document which might be
the case for many clients so
so
that is the
that is the place when you need tesseract
tesseract
to tesseract can extract
handwritten text digitally scanned text
etc tesseract can extract text from
images like is
that's the second option which I wanted
to show you in this data injection
pipeline. Let's see how would the text
extractor know if the image is text or
not. It would not know, right? It would just
just
you mean the tesseract.
The tesseract knows it because the
libraries which it uses specifically
looks for text in that image. But if you
use pi mu PDF, it will not know. It will
just save that entire image as one image.
Oh, the fruit image, right? It will not
know whether there is text or not. So,
even if this image has a text, pyu pdf
will save it as an image,
but we will not know what is the text
written on that image. That's the main
issue. The image will be saved. That's
not an issue. But all the information on
that image, that image will be accessed
as a whole body. There will be nothing
like there are characters in this image
or there is text in this image. the
granularity will be lost if you don't
Then comes there is a question of
tabular data, right? How do you deal
with uh tabular data? For that what I
want to really do is introduce a third
library which has now become extremely
popular and I would say it's
hands down one of the best libraries for um
um
language modeling tasks. How many of you
Yeah. So dockling is relatively new.
It's I think newer than all of the other
libraries which I showed to you but they
have already so we have used docklink in
our industrial projects and it's
amazing. One reason why docklink is
amazing is because it is specifically
meant for generative AI. What do I mean
by that? So whenever dockling encounters
a table in a PDF the tables are saved as
real tables. Rows and columns are preserved.
preserved.
uh dockling can even convert a schema
into a JSON format directly
and if anyone of you has used language
models in production before you know
that it's very important for you to
retain certain elements in JSON format
or in markdown format. So if you
encounter a table somewhere or if you
encounter uh any schematic or a schema
that can also be analyzed by dockling
and that can be saved as a table.
Further dockling can be linked with an
OCR tool as well. Doc dockling can be
externally linked with an OCR tool like tesseract.
tesseract.
So you have the OCR capability as well.
You can
extract tables very easily. You can
extract schemas very easily.
In fact, if anyone of you is interested
this is the Dockling technical report
where they actually mention exactly how they
they
u they manage to retain tables in an
What happens if the input document has
text, images, tables and images with
text, right? Then exactly what you will
do in that case. If your text document
is extremely messy and if it has images,
if it has tables, if it has um
um
let's say images, scanned copies, then
you can use dockling and you can use OCR
along with it.
If your document is very simple like
what I have, you can use pyu pdf. If
your document does not have too many
tables but just has scanned images, you
can use tesseract.
So this is the first engineering
engineer choice section which I have. I
mentioned at the start that I will have
this section for all these parts right
data injection, chunking, embedding and
open source or closed source LLM.
So this is the first point where I have
this section called engineer choice.
When given a project, what document
processing tool are you going to use?
That's the first thing which you need to understand
understand
and that depends on the type of
documents which you really have. Now
actually before this there is one more
step which I have not discussed and that
is related to scraping.
So it may happen that some websites if
you go or some client websites have PDF
which you can download but in some cases
the data is not in PDF format. So you
need to first scrape that entire data
and then use these processing tools
which I have just mentioned to you right now.
How can we have hybrid pipeline with all
these three? So Samrat if you want
hybrid pipeline right then the best is
to use dockling with an external OCR
tool if you go to dockling in the
document itself they say that they can
handle diverse format that's good um
they can export into various formats
like markdown HTML JSON and most
importantly they have extensive OCR
support for scanned PDF and images so
this one library has all of these things
if you are dealing with complex PDFs or
if you are dealing with um
complex images rather.
One more thing which we explored at
Vijara recently is this Mistral
uh OCR. How many of you are aware of this?
They have a special language. They have
a special model which they have released
recently which is apparently supposed to
okay so one good question has been asked
in the chat that what about this
mirrorboard itself if it's to be
retrieved now let me ask this question
to all of you right let's take this
mirror board which I have
which tool will you use to retrieve text
Dockling for sure will be good but I
would probably use tesseract for this.
The reason I would use tesseract for
this is actually
what do I have here? If you think about
it, I have some images and I have some
text right which is written and this
text is very messy. So if you take images of this of course a normal pi mu
images of this of course a normal pi mu PDF will not be able to handle this. But
PDF will not be able to handle this. But I don't have too many complex like I
I don't have too many complex like I don't have any table really even this
don't have any table really even this table which I have that's an image. This
table which I have that's an image. This is not a real table. So technically I
is not a real table. So technically I don't have any tables. I have I would
don't have any tables. I have I would probably take images of this make it
probably take images of this make it into a PDF. So I just have PDF. I just
into a PDF. So I just have PDF. I just have images and I have text which will
have images and I have text which will be scanned. So I definitely need an OCR
be scanned. So I definitely need an OCR tool over here.
Yeah. Uh this Mistral right I want to spend some more time on this because
this is new. This has just come out. I think it came out three months back or
think it came out three months back or yeah we are trying it at Vijuara right
yeah we are trying it at Vijuara right now. I don't know how it is but it's
now. I don't know how it is but it's supposed to be amazing.
Uh and there are several such LLMs itself which are specifically meant for
itself which are specifically meant for doing OCR tasks
scraping right so let me tell a bit about scraping right now.
about scraping right now. So let's say you go to Mahindra and
So let's say you go to Mahindra and Mahindra website
Mahindra website and you are doing a project with
and you are doing a project with Mahindra and what they have told you is
Mahindra and what they have told you is that I want to make a chatbot which is
that I want to make a chatbot which is specific to Mahindra rise let's say but
specific to Mahindra rise let's say but they have not given you any data
they have not given you any data then what will you do at this stage how
then what will you do at this stage how do you collect the data if the client
do you collect the data if the client has not given you PDF copies of the data
has not given you PDF copies of the data or if the client has not given you uh
or if the client has not given you uh like anything about the data
like anything about the data The only thing which you can do at this
The only thing which you can do at this point is called as scraping.
point is called as scraping. Uh yeah. So what you have to do is that
Uh yeah. So what you have to do is that you have to go through different
you have to go through different sections and you have to use a scraping
sections and you have to use a scraping tool to scrape this data. I'm going to
tool to scrape this data. I'm going to tell you two to three scraping tools
tell you two to three scraping tools which can be used. So first is called
which can be used. So first is called fire crawl.
Uh again very good scraping tool. It has around 50,000
around 50,000 stars. One good thing is that it
stars. One good thing is that it actually takes with this fire crawl tool
actually takes with this fire crawl tool you probably don't even need a PDF
you probably don't even need a PDF extractor tool because it already takes
extractor tool because it already takes the entire website and converts it into
the entire website and converts it into LLM ready markdown or structured data
LLM ready markdown or structured data that's one tool then second is as
that's one tool then second is as someone has mentioned in the chat
someone has mentioned in the chat beautiful soup
if you have HTML pages especially beautiful soup looks at that and fully
beautiful soup looks at that and fully extracts that and another thing is
extracts that and another thing is called puppeteer.
U so it's it's an automation tool. Actually with puppeteer you can with
Actually with puppeteer you can with puppeteer you can do some clever things.
puppeteer you can do some clever things. So someone uh mentioned about named
So someone uh mentioned about named entity recognition right? So what if you
entity recognition right? So what if you want to go through different sections
want to go through different sections but you don't want to you only want to
but you don't want to you only want to take headings or titles from each page.
take headings or titles from each page. How will you do that with a normal web
How will you do that with a normal web scrap web scraper? It's bit difficult to
scrap web scraper? It's bit difficult to do that specific amount of extracting in
do that specific amount of extracting in puppeteer. What you can do is you can
puppeteer. What you can do is you can automate the scraping by telling that I
automate the scraping by telling that I only want font size of this this or a
only want font size of this this or a header tax to be selected. I only want
header tax to be selected. I only want paragraph tags to be selected when you
paragraph tags to be selected when you scrape.
scrape. So that way I think puppeteer you can
So that way I think puppeteer you can also install this as a
also install this as a java javascript library.
java javascript library. Yeah. So it's an API to control Chrome
Yeah. So it's an API to control Chrome or Firefox. It can go through. So my
or Firefox. It can go through. So my question to all of you is this. Let's
question to all of you is this. Let's say if you have if a client has 5,000
say if you have if a client has 5,000 links, you cannot manually go and scrape
links, you cannot manually go and scrape each link, right? You you need an
each link, right? You you need an automation tool which goes through this
automation tool which goes through this link. It scrapes whatever is there. Then
link. It scrapes whatever is there. Then it goes through this link, scrap scrapes
it goes through this link, scrap scrapes whatever is there. Puppeteer provides
whatever is there. Puppeteer provides you that advantage. You can automate an
you that advantage. You can automate an entire workflow through puppeteer and
entire workflow through puppeteer and just sit back and get all the files
just sit back and get all the files downloaded but you need to define that
downloaded but you need to define that workflow very nicely.
Selenium also selenium is good. Uh but Jay I found that puppeteer at least. So
Jay I found that puppeteer at least. So we had a pro client project where we use
we had a pro client project where we use puppeteer. They had around 5,000
puppeteer. They had around 5,000 documents they wanted to be extracted
documents they wanted to be extracted through scraping and manual scraping
through scraping and manual scraping took a long time. So we used puppeteer
took a long time. So we used puppeteer at that moment.
How effective is fire crawl when dealing with websites? Yeah. Yeah. then it's not
with websites? Yeah. Yeah. then it's not I'm not sure actually how it bypasses
I'm not sure actually how it bypasses the authentication
the authentication um of websites
um of websites I need to check that
yeah manual scraping right it takes huge amount of time in fact for the client
amount of time in fact for the client project which I mentioned earlier we
project which I mentioned earlier we were doing manual scraping but it was
were doing manual scraping but it was just too expensive in terms of time and
just too expensive in terms of time and everything
everything is extracting data in tabular format.
is extracting data in tabular format. Yeah, definitely Ashwini. So basically
Yeah, definitely Ashwini. So basically the dockling tool which I mentioned
the dockling tool which I mentioned right it can extract data from anywhere.
right it can extract data from anywhere. It can extract data from images, tabular
It can extract data from images, tabular format, it can extract data from PDF
format, it can extract data from PDF snippets. Basically anything which you
snippets. Basically anything which you want but just keep in mind that if any
want but just keep in mind that if any of you is actually working on an
of you is actually working on an industrial project, sometimes clients
industrial project, sometimes clients don't give you data even in PDF format.
don't give you data even in PDF format. Then you have to do scraping on top of
Then you have to do scraping on top of it.
Okay. Now what we are going to do is that we are going to code the first part
that we are going to code the first part which we just saw. I will take the
which we just saw. I will take the remaining questions in the chat. But
remaining questions in the chat. But first what we will do is that for this
first what we will do is that for this PDF all of us have identified that we
PDF all of us have identified that we will use pi mu PDF.
will use pi mu PDF. Uh yes. Did everyone understand why we
Uh yes. Did everyone understand why we are using py mu PDF for the current
are using py mu PDF for the current task?
task? Can you type yes in the chat if you have
Can you type yes in the chat if you have understood why we are using py mu pdf
understood why we are using py mu pdf for the current project and not any
for the current project and not any other tool. Okay, good. So now our
other tool. Okay, good. So now our coding journey is going to start. I'm
coding journey is going to start. I'm going to share this
going to share this Google collab code file with all of you
Google collab code file with all of you and after the data extraction is done,
and after the data extraction is done, we are going to take a small break. Uh
we are going to take a small break. Uh so I know attention time spans are a bit
so I know attention time spans are a bit less
less but uh no issues.
So this is the Google collab code file and someone has asked to share the PDF
and someone has asked to share the PDF right?
right? Yeah. So for the PDF I actually shared
Yeah. So for the PDF I actually shared this document at the start of the
this document at the start of the lecture itself. not document this drive
lecture itself. not document this drive folder at the start of the lecture
folder at the start of the lecture itself.
Oh yeah. So J uh that's a great point which you mentioned which I I definitely
which you mentioned which I I definitely want to address. I forgot to address
want to address. I forgot to address actually. Uh
actually. Uh so you might be thinking why does PIMU
so you might be thinking why does PIMU PDF actually exist right because it's
PDF actually exist right because it's extremely fast.
extremely fast. Pyu PDF is 10 to 15 times faster than
Pyu PDF is 10 to 15 times faster than dockling. That's the trade-off here. So
dockling. That's the trade-off here. So if you go there are some Reddit trades
if you go there are some Reddit trades which actually argue about this.
Yeah. See dockling is at least 50 times slower than pi mu PDF. So if you have a
slower than pi mu PDF. So if you have a simple text like what we do don't use
simple text like what we do don't use the
the powerful libraries unnecessarily. That
powerful libraries unnecessarily. That will just be very slow for you. But
will just be very slow for you. But that's a good point you bring up. I
that's a good point you bring up. I wanted to touch upon that but it slipped
wanted to touch upon that but it slipped my mind
my mind anyway. So all of you have access to
anyway. So all of you have access to this notebook. Now the first thing which
this notebook. Now the first thing which you have to do is you have to go to
you have to do is you have to go to runtime and you have to switch to T4
runtime and you have to switch to T4 GPU.
GPU. We are going to start very slowly and we
We are going to start very slowly and we are going to start with the data
are going to start with the data injection pipeline. Okay. So before that
injection pipeline. Okay. So before that there is some a long text here which you
there is some a long text here which you can even read after this lecture is
can even read after this lecture is done. I have covered this all in the
done. I have covered this all in the initial portion of the class. This
initial portion of the class. This schematic also I have shared on the
schematic also I have shared on the mirro board. Now what we can do is
mirro board. Now what we can do is directly start from here requirements
directly start from here requirements and setup. So if all of you are
and setup. So if all of you are connected to T4 GPU, this notebook
connected to T4 GPU, this notebook should by the way by default already
should by the way by default already connect you to T4. And then just click
connect you to T4. And then just click on this. So the first two cells are
on this. So the first two cells are where we are installing the packages.
where we are installing the packages. These two steps will take some amount of
These two steps will take some amount of time. So I'm going to wait for here till
time. So I'm going to wait for here till all of you are running this. And
all of you are running this. And meanwhile, let me answer some questions
meanwhile, let me answer some questions in the chat which I might not have seen.
in the chat which I might not have seen. Can you share the PDF? I Yeah, I think I
Can you share the PDF? I Yeah, I think I shared it right now.
shared it right now. I am working on a project where I need
I am working on a project where I need to extract release documents from GitHub
to extract release documents from GitHub pages. Is Puppeteer a good choice? Yeah,
pages. Is Puppeteer a good choice? Yeah, definitely.
definitely. First, Spurs, I would encourage you to
First, Spurs, I would encourage you to explore fire crawl
explore fire crawl because Puppeteer is a very low-level
because Puppeteer is a very low-level library. When I say lowle, it directly
library. When I say lowle, it directly operates at JavaScript. So if you want
operates at JavaScript. So if you want to use puppeteer you need to be very
to use puppeteer you need to be very comfortable with JS code.
comfortable with JS code. Fire crawl abstracts many things. So
Fire crawl abstracts many things. So it's easier to use. If you are
it's easier to use. If you are comfortable with JS then I would suggest
comfortable with JS then I would suggest to go ahead with JS. Sure.
Uh after setting up the data pipeline the biggest challenge I
data pipeline the biggest challenge I faced was keeping the changing data
faced was keeping the changing data synced with the vector database. Any
synced with the vector database. Any suggestion? Great point Prashant. I will
suggestion? Great point Prashant. I will come to this. I do have a suggestion for
come to this. I do have a suggestion for this and in one word the suggestion is
this and in one word the suggestion is to use PG vector
we are going to use PG vector. Essentially the best thing to keep
Essentially the best thing to keep database versus vector database synced
database versus vector database synced is to keep everything in one place.
is to keep everything in one place. That's the only way to do this is to use
That's the only way to do this is to use a Postgress database with PG vector.
a Postgress database with PG vector. We'll see that tomorrow.
We'll see that tomorrow. Where can I get the link to this
Where can I get the link to this notebook? So link I have already shared.
Oh I I shared the copy link. So in this copy file I have removed the hugging
copy file I have removed the hugging face access token. Yeah this is that
face access token. Yeah this is that link.
There is some question in the chat about this book. This uh actually I have not
this book. This uh actually I have not seen this yet.
Let me check this. Yeah, it seems to be very highly cited
Yeah, it seems to be very highly cited especially for vision based uh document
especially for vision based uh document retrieval.
retrieval. One metric which I look for to check how
One metric which I look for to check how popular a tool is is GitHub stars and
popular a tool is is GitHub stars and how active it is on GitHub. So it seems
how active it is on GitHub. So it seems to be quite active. Last commit was made
to be quite active. Last commit was made 5 days back.
5 days back. Um that's a good paper. I'll definitely
Um that's a good paper. I'll definitely add it to my read list.
It is asked to me in an interview if we extracted any using LLM or rag how will
extracted any using LLM or rag how will we validate if they are correct? Again a
we validate if they are correct? Again a very good question. So Prem always
very good question. So Prem always remember that there are two types of
remember that there are two types of validation right.
validation right. There is structural validation and there
There is structural validation and there is semantic validation.
When I say structural validation, it means whether the structure of your
means whether the structure of your retrieved items are correct or not. And
retrieved items are correct or not. And one way to implement structural
one way to implement structural validation which we have already seen in
validation which we have already seen in one of the previous lectures is to use
one of the previous lectures is to use piodantic
piodantic where we can check whether the format is
where we can check whether the format is correct. But to use semant but to do
correct. But to use semant but to do semantic validation there are two types
semantic validation there are two types either human as a judge or LLM as a
either human as a judge or LLM as a judge where if you want to do semantic
judge where if you want to do semantic validation either you have the ground
validation either you have the ground truth data and you validate with that or
truth data and you validate with that or you use a larger LLM to give you the
you use a larger LLM to give you the ground truth data and validate your
ground truth data and validate your extraction with that.
extraction with that. How do you keep track of good papers and
How do you keep track of good papers and make it a habit? Yeah. So that is a bit
make it a habit? Yeah. So that is a bit challenging. So one thing which has
challenging. So one thing which has honestly worked for me amit bit
honestly worked for me amit bit counterintuitive is LinkedIn. My
counterintuitive is LinkedIn. My LinkedIn feed is extremely well curated
LinkedIn feed is extremely well curated and that is also because I spend a lot
and that is also because I spend a lot of time scrolling through LinkedIn and I
of time scrolling through LinkedIn and I read mostly I'm on LinkedIn so I read
read mostly I'm on LinkedIn so I read things which I like. So algorithm picks
things which I like. So algorithm picks up on that. So everything which I get is
up on that. So everything which I get is from people who talk about new things.
from people who talk about new things. Um so I'm following some key set of
Um so I'm following some key set of people who whenever something new is
people who whenever something new is released they will post it.
released they will post it. So mostly I'm trying to avoid flashy
So mostly I'm trying to avoid flashy things on LinkedIn. There are like two
things on LinkedIn. There are like two camps. One camp is like whenever let's
camps. One camp is like whenever let's say context engineering right whenever
say context engineering right whenever context engineering is a thing then
context engineering is a thing then someone will make a post that five
someone will make a post that five reasons why you should learn context
reasons why you should learn context engineering. I avoid those but on my
engineering. I avoid those but on my feed there are people who write about
feed there are people who write about let's say context engineering what are
let's say context engineering what are the papers you should read then how is
the papers you should read then how is it different from so more informative
it different from so more informative and
and not too much flash it's getting a
not too much flash it's getting a challenge for me but I make it a point
challenge for me but I make it a point to at least read two papers per week
to at least read two papers per week and also implement those
I I don't have a two I do have a two read list I'll share it with you. That's
read list I'll share it with you. That's only that I make it week to week. I have
only that I make it week to week. I have it for this week. So in this week's two
it for this week. So in this week's two read list, I have this transfusion
read list, I have this transfusion paper.
This paper is on my to read list. This this week
this week uh
uh and one more thing is on my to read list
and one more thing is on my to read list is the link which I already shared with
is the link which I already shared with you.
you. It's this.
It's this. In fact, I already ordered one of these
In fact, I already ordered one of these books to our office because I'm now
books to our office because I'm now encouraging all of our people to master
encouraging all of our people to master GPU programming. I can't believe they
GPU programming. I can't believe they made this free. It's it's the amazing
made this free. It's it's the amazing but an extremely complex go through of
but an extremely complex go through of how LLM utilize our GPUs.
how LLM utilize our GPUs. But I I like ordering physical books.
But I I like ordering physical books. So, I've ordered two copies of this for
So, I've ordered two copies of this for our office. This is also on my to read
our office. This is also on my to read list. I finished two chapters. I'm going
list. I finished two chapters. I'm going to make a course on this because I have
to make a course on this because I have literally not found a single good course
literally not found a single good course on GPU programming anywhere.
Uh, okay. So, how many of you have finished running until these two steps
finished running until these two steps at the moment?
at the moment? How many of you have finished installing
How many of you have finished installing packages?
packages? You have, right? Okay, good. Now, what
You have, right? Okay, good. Now, what we have to do is that the next step is
we have to do is that the next step is just document processing. So in this
just document processing. So in this part we are going to uh download the
part we are going to uh download the PDF.
PDF. If it does not exist it's fine. So one
If it does not exist it's fine. So one way is to just add it on the left hand
way is to just add it on the left hand side over here. But if it does not exist
side over here. But if it does not exist in this code we'll just go ahead and
in this code we'll just go ahead and download the PDF. And the next code
download the PDF. And the next code block is where we are actually going to
block is where we are actually going to read this PDF. So let's go through this
read this PDF. So let's go through this code block step by step. First there is
code block step by step. First there is a text formatter. So what it will do is
a text formatter. So what it will do is that it will make sure there are not
that it will make sure there are not empty spaces in any of the text which we
empty spaces in any of the text which we are reading. Then we have this open and
are reading. Then we have this open and read PDF. So this import fits which we
read PDF. So this import fits which we are doing right that's the pyu pdf.
are doing right that's the pyu pdf. This py mu pdfdf github repository when
This py mu pdfdf github repository when we do import fits that loads the
we do import fits that loads the package. Um and the way we open a file
package. Um and the way we open a file through pyu pdf is doing fits.open.
through pyu pdf is doing fits.open. Then what we are going to do is that we
Then what we are going to do is that we are going to go through every single
are going to go through every single page in my document. I'm going to get
page in my document. I'm going to get text from that page. So page dot get
text from that page. So page dot get text. Okay. Then what I'm going to do is
text. Okay. Then what I'm going to do is that I'm going to format
that I'm going to format uh this text to remove empty spaces.
uh this text to remove empty spaces. Um and then what I'm going to do I'm
Um and then what I'm going to do I'm going to maintain a list. So I'm going
going to maintain a list. So I'm going to maintain a list like this.
to maintain a list like this. So every page for each page I'm going to
So every page for each page I'm going to store the page number the number of
store the page number the number of characters on that page the word count
characters on that page the word count on that page the number of sentences on
on that page the number of sentences on that page and the actual text.
that page and the actual text. Okay.
Okay. So what this piece of code is doing is
So what this piece of code is doing is that we are maintaining a list called
that we are maintaining a list called pages and text and each element of that
pages and text and each element of that list is a dictionary.
list is a dictionary. So the first element
first element of this list is page one. First element of this list is page one.
First element of this list is page one. And what will page one have? Page one is
And what will page one have? Page one is a dictionary
Similarly, the second element of this is page two etc. So what I'm essentially
page two etc. So what I'm essentially doing is that I'm making a list
and in each list this is page number one. This is page number two dot dot dot
one. This is page number two dot dot dot right up till page number one 208
right up till page number one 208 and in each page
and in each page each page list I'm storing these values.
each page list I'm storing these values. I'm storing the page number. I'm storing
I'm storing the page number. I'm storing the text. Of course, the main thing is
the text. Of course, the main thing is the text also I'm storing.
the text also I'm storing. So you can run this now and then what
So you can run this now and then what you can do is that you can
you can do is that you can just randomly print out two dictionaries
just randomly print out two dictionaries from this list. So I have printed out
from this list. So I have printed out the page number text for one page and
the page number text for one page and this is for second page. So you might be
this is for second page. So you might be wondering why is this minus 41 here,
wondering why is this minus 41 here, right? Why am I subtracting minus 41
right? Why am I subtracting minus 41 over here?
over here? The reason is that if you actually take
The reason is that if you actually take a look at our
a look at our uh book right, it really starts from
uh book right, it really starts from page number 41 or 42 here. This is where
page number 41 or 42 here. This is where our book actually starts. Yeah. Here. So
our book actually starts. Yeah. Here. So what is actually page number one
what is actually page number one is page number. So you need to subtract
is page number. So you need to subtract 42 pages actually to get to page number
42 pages actually to get to page number one.
one. So all of the pages which come before
So all of the pages which come before this are marked as negative since we
this are marked as negative since we subtract 41 and then page number one
subtract 41 and then page number one will rightly start from here.
Uh and then what we can do is that we can
and then what we can do is that we can just get a random sample. So now our
just get a random sample. So now our dictionary or our list is called pages
dictionary or our list is called pages and text right that is our list. We can
and text right that is our list. We can get a random element from this list. So
get a random element from this list. So we can see we have got page number 1019.
we can see we have got page number 1019. The number of characters are 1574.
The number of characters are 1574. Number of words are 270. Oh by the way
Number of words are 270. Oh by the way we are also maintaining number of
we are also maintaining number of tokens. So for this the simple thing we
tokens. So for this the simple thing we are doing is number of characters
are doing is number of characters divided by four.
divided by four. That's the number of tokens which we are
That's the number of tokens which we are assuming.
assuming. So each page dictionary will look
So each page dictionary will look something like this. We have page
something like this. We have page number, the number of characters on that
number, the number of characters on that page, the number of words on that page,
page, the number of words on that page, the number of sentences on that page,
the number of sentences on that page, the actual text. That's it.
Uh and then what you can do is that uh you
and then what you can do is that uh you can actually get some statistics on the
can actually get some statistics on the text. So just run this
text. So just run this and uh get the different statistics. So
and uh get the different statistics. So for example
for example for each so for this page the character
for each so for this page the character count is 29 the word count is four the
count is 29 the word count is four the sentence count is one the page token
sentence count is one the page token count is 7.25
um and then you can get an overall statistics. So this is the main thing
statistics. So this is the main thing which we want to focus on right now.
which we want to focus on right now. Mean let's take a look at this mean row.
Mean let's take a look at this mean row. So on an average all pages have roughly
So on an average all pages have roughly around 198 words. On an average all
around 198 words. On an average all pages have around 10 words, 10 sentences
pages have around 10 words, 10 sentences roughly. And on an average, each page
roughly. And on an average, each page has around 287 words.
Why is this important? Why are we looking at the number of uh tokens on
looking at the number of uh tokens on each page?
each page? Can someone try to think
Can someone try to think why are we looking at the number of
why are we looking at the number of tokens on each page?
There is an error which Krishna has got pages and text is not defined. Krishna
pages and text is not defined. Krishna have you run this
have you run this because we have defined pages and text
because we have defined pages and text over here.
Now I'm going to the whiteboard and the question which I'm asking to all of you
question which I'm asking to all of you is that we got these statistics right?
is that we got these statistics right? We got these statistics
We got these statistics that each page has let's say
yeah so eventually we want to take so let's say we want to take a page and we
let's say we want to take a page and we want to convert the page into an
want to convert the page into an embedding vector
let's say we use this model all MP net base V2.
base V2. The issue is that
The issue is that in very fine print they have mentioned
in very fine print they have mentioned that input text longer than 384 word
that input text longer than 384 word pieces is truncated.
pieces is truncated. So that is going to be an issue for us.
So that is going to be an issue for us. If our page is more than 384 or 400
If our page is more than 384 or 400 words, we cannot embed our entire page
words, we cannot embed our entire page into a vector using this model because
into a vector using this model because then some information will unfortunately
then some information will unfortunately be lost.
So that's why just a better idea to make sure that
sure that whenever you're looking at pages just
whenever you're looking at pages just take a look at okay how many words do
take a look at okay how many words do they have on an average how many tokens
they have on an average how many tokens do they have on average. So here it
do they have on average. So here it seems that each page on an average is
seems that each page on an average is 287 tokens which is lesser than 384
287 tokens which is lesser than 384 right. So it is fine to go ahead with
right. So it is fine to go ahead with this. So potentially each page can be
this. So potentially each page can be embedded with embedding models.
embedded with embedding models. Currently we have not decided which
Currently we have not decided which embedding model to use. We have not even
embedding model to use. We have not even decided if one page is equal to one
decided if one page is equal to one chunk. But potentially if we decide that
chunk. But potentially if we decide that one page is one chunk and we want to
one page is one chunk and we want to embed each page, we can very safely use
embed each page, we can very safely use this allimpinate base version two.
this allimpinate base version two. That's the reason why we should actually
That's the reason why we should actually keep a track of how many tokens are
keep a track of how many tokens are there on each page, how many words are
there on each page, how many words are there on each page. The thing is when
there on each page. The thing is when you directly use rag libraries on lang
you directly use rag libraries on lang chain, all of this information is lost
chain, all of this information is lost to you. They directly give you a PDF but
to you. They directly give you a PDF but you should yourself see how many pages
you should yourself see how many pages are there what's the token count on each
are there what's the token count on each page what's the word count on each page
page what's the word count on each page what's the sentence count on each page
what's the sentence count on each page etc
etc we are going to come to chunking right
we are going to come to chunking right now so don't worry about it the next
now so don't worry about it the next thing which we are going to do is
thing which we are going to do is chunking Rahul has asked a question is
chunking Rahul has asked a question is rag plus SLM a practical combination
rag plus SLM a practical combination yeah definitely
yeah definitely because uh rags are much better than
because uh rags are much better than fine-tuning. Anyways, we'll come to that
fine-tuning. Anyways, we'll come to that actually after the lecture is done
actually after the lecture is done that is mean. What about max? Sure, max.
that is mean. What about max? Sure, max. But check the standard deviation also,
But check the standard deviation also, right? Standard deviation is 140. So,
right? Standard deviation is 140. So, even with this, it seems the one or two
even with this, it seems the one or two standard deviations are around let's say
standard deviations are around let's say 400 token length or something. So, we
400 token length or something. So, we are fine.
Sorry, I did not understand what you mean about lang chain. So uh when you
mean about lang chain. So uh when you see tutorials of rag on lang chain llama
see tutorials of rag on lang chain llama index these tutorials are 10 to 15
index these tutorials are 10 to 15 minute long and they completely skip
minute long and they completely skip this part. They already assume that you
this part. They already assume that you have a PDF and then everything starts
have a PDF and then everything starts like at a much later stage. But in
like at a much later stage. But in practice this is what you have to do
practice this is what you have to do first. So this is the exploratory data
first. So this is the exploratory data analysis equivalent. When we do a normal
analysis equivalent. When we do a normal machine learning problem we do EDA
machine learning problem we do EDA right? You also need to do some EDA when
right? You also need to do some EDA when you do rag.
There is a question about lecture recording. So I I I will share the
recording. So I I I will share the lecture recording and the Google collab.
lecture recording and the Google collab. I have already shared it on chat.
Okay. So now we are going to take a break for some time and we are going to
break for some time and we are going to cover chunking. I definitely do want to
cover chunking. I definitely do want to cover chunking today because uh it is
cover chunking today because uh it is one of the most important pieces of the
one of the most important pieces of the puzzle and nowhere
puzzle and nowhere on the internet on any YouTube video I
on the internet on any YouTube video I found comprehensive
found comprehensive uh explanations of chunking. In fact,
uh explanations of chunking. In fact, there are blogs on chunking. There are
there are blogs on chunking. There are good blogs
good blogs but blogs can only take you so far.
but blogs can only take you so far. Right? In the chunking
Right? In the chunking uh section what I'm going to do is that
uh section what I'm going to do is that first we are going to understand all the
first we are going to understand all the types of chunking in detail and then we
types of chunking in detail and then we are actually going to code different
are actually going to code different chunking strategies. We are going to
chunking strategies. We are going to code them from scratch and we are going
code them from scratch and we are going to compare these different chunking
to compare these different chunking strategies with each other.
strategies with each other. Um but we will take a break. So earlier
Um but we will take a break. So earlier what I had planned is that I planned one
what I had planned is that I planned one and a half hour for today and one and a
and a half hour for today and one and a half hour for tomorrow. But I think
half hour for tomorrow. But I think today
today today itself we will take around 2 and a
today itself we will take around 2 and a half hours it looks like. So if any of
half hours it looks like. So if any of you uh I did not plan three-hour
you uh I did not plan three-hour workshop today Sanjay honestly but uh
workshop today Sanjay honestly but uh it's good that you are asking so many
it's good that you are asking so many questions
questions we have many more things left to cover.
we have many more things left to cover. So it depends on your time schedule. If
So it depends on your time schedule. If all of you want to catch the recording
all of you want to catch the recording you can do that. But anyways, I will
you can do that. But anyways, I will come back now after 5 minutes
come back now after 5 minutes uh to start the chunking part. If you
uh to start the chunking part. If you are available, you can stay stay live to
are available, you can stay stay live to watch the chunking. If not, I'm going to
watch the chunking. If not, I'm going to upload the lecture recording anyways.
Uh yeah, Samrat, when we do chunking, it's
yeah, Samrat, when we do chunking, it's not
not not we don't need the EDA really later
not we don't need the EDA really later when we do the chunking, but it's still
when we do the chunking, but it's still good to see what's the number of tokens
good to see what's the number of tokens we have. It might change our intuition
we have. It might change our intuition later.
later. Okay. So I'll just come back after 4 to
Okay. So I'll just come back after 4 to 5 minutes. It might take 1 to one and a
5 minutes. It might take 1 to one and a half more hours today because so what we
half more hours today because so what we can do today we can finish chunking and
can do today we can finish chunking and then tomorrow we can do embedding the
then tomorrow we can do embedding the LLM and then the final production.
LLM and then the final production. Yeah. Thanks guys. I'll I'll come back
Yeah. Thanks guys. I'll I'll come back after after around maybe 9:35.
Um all right everyone. So let's begin with the next part of today's lecture
with the next part of today's lecture which is going to be chunking.
uh there is a reason why I have allocated a separate section to this
allocated a separate section to this because
because I believe it is one of the most
I believe it is one of the most important pieces of the rack pipeline.
important pieces of the rack pipeline. Um let me explain to all of you why
Um let me explain to all of you why chunking is important. So until now what
chunking is important. So until now what we have seen is that we have processed
we have seen is that we have processed the
the uh we have processed the PDF.
So we have processed the PDF that part is done.
is done. Okay. Now what we have to do is we
Okay. Now what we have to do is we finally this is our LLM.
finally this is our LLM. The LLM will get a prompt from the user
The LLM will get a prompt from the user of course but the LLM will also get some
of course but the LLM will also get some retrieved information.
The retrieved information from our knowledge base or from the PDF.
Now what we are doing in the chunking section is that we are essentially
section is that we are essentially bridging this gap.
bridging this gap. We are essentially bridging the gap. So
We are essentially bridging the gap. So now we have processed the PDF. How do we
now we have processed the PDF. How do we go from this PDF to retrieving bits of
go from this PDF to retrieving bits of information which are important and
information which are important and there are two key steps to this. The
there are two key steps to this. The first is chunking
first is chunking and the second is called as embedding.
and the second is called as embedding. We are going to look at embedding
We are going to look at embedding tomorrow but today let's cover chunking.
tomorrow but today let's cover chunking. So the way it works is that if let's say
So the way it works is that if let's say let me take a sample
The first thing which I'm going to do is that I'm going to divide this PDF into
that I'm going to divide this PDF into chunks.
chunks. And uh when I say chunk, a chunk can be
And uh when I say chunk, a chunk can be let's say these are my chunks.
This can be one type of chunking. So imagine that in the whole PDF every
imagine that in the whole PDF every sentence is one chunk or you can even
sentence is one chunk or you can even have chunking which is page page level
have chunking which is page page level chunking. So this entire page is one
chunking. So this entire page is one chunk. This entire page is another chunk
chunk. This entire page is another chunk etc.
etc. Now let's say you do some sort of
Now let's say you do some sort of chunking and you have these chunks.
You have 2,000 chunks. Let's say you have split the documents. You have split
have split the documents. You have split the knowledge base or the document which
the knowledge base or the document which you have into 2,000 chunks.
you have into 2,000 chunks. In the retrieved information,
In the retrieved information, the only portion which you are going to
the only portion which you are going to select is some of these chunks. Maybe
select is some of these chunks. Maybe you select the chunks which are most
you select the chunks which are most closely related to the prompt. The top
closely related to the prompt. The top chunk. So you can select this one chunk
chunk. So you can select this one chunk or you might select top three chunks
or you might select top three chunks which are most closely related to the
which are most closely related to the prompt.
prompt. So you can select one chunk, you can
So you can select one chunk, you can select two chunks, you can select three
select two chunks, you can select three chunks that you have to decide. But if
chunks that you have to decide. But if normally people select between 1 to 10
normally people select between 1 to 10 chunks. So let's say you select three
chunks. So let's say you select three chunks. These are the three chunks which
chunks. These are the three chunks which will be passed as the retrieved
will be passed as the retrieved information.
information. Now you see the problem here is that or
Now you see the problem here is that or I should not call problem.
I should not call problem. Your the quality of your output is going
Your the quality of your output is going to completely and solely depend on your
to completely and solely depend on your retrieved information and your retrieved
retrieved information and your retrieved information is going to completely
information is going to completely defend depend on what type of chunks you
defend depend on what type of chunks you have. Because if you have granular
have. Because if you have granular chunking like sentences, this will be
chunking like sentences, this will be just one sentence. This will be second
just one sentence. This will be second sentence and this will be third
sentence and this will be third sentence. So you'll just pass three
sentence. So you'll just pass three sentences. But if you have broad level
sentences. But if you have broad level chunking like pages then each chunk will
chunking like pages then each chunk will be one page. So you'll be passing page
be one page. So you'll be passing page one, you'll be passing page two and
one, you'll be passing page two and you'll be passing page three.
you'll be passing page three. So
So imagine this as the brain of the LLM and
imagine this as the brain of the LLM and uh so this is the LLM and this is the
uh so this is the LLM and this is the data.
data. the retrieved information which passes
the retrieved information which passes through the LLM will be from a list of
through the LLM will be from a list of chunks and only a subset of these chunks
chunks and only a subset of these chunks will be passed to the LLM. So from the
will be passed to the LLM. So from the engineer's perspective it becomes
engineer's perspective it becomes extremely important to decide how are we
extremely important to decide how are we exactly going to do the chunking. There
exactly going to do the chunking. There are so many ways right the the sky is
are so many ways right the the sky is completely open that we can do anything.
completely open that we can do anything. So now let me ask all of you. Let's say
So now let me ask all of you. Let's say this is the PDF
this is the PDF U 1,28 pages PDF. How should we go about
U 1,28 pages PDF. How should we go about chunking?
chunking? What will you have as individual chunks?
What will you have as individual chunks? Heading wise. So Samrat is saying
Heading wise. So Samrat is saying heading wise, right? U so essentially I
heading wise, right? U so essentially I think what Samrat is saying that
think what Samrat is saying that wherever there are headings
wherever there are headings you make that as one chunk. So make this
you make that as one chunk. So make this as one chunk. So if carbohydrates is a
as one chunk. So if carbohydrates is a heading, make carbohydrate section as
heading, make carbohydrate section as one chunk. If lipids is a heading, make
one chunk. If lipids is a heading, make this section as one chunk. If proteins
this section as one chunk. If proteins is a heading, make this section as one
is a heading, make this section as one chunk. Um I think that's what Dishant
chunk. Um I think that's what Dishant means by sections.
means by sections. Aditya has an interesting suggestion.
Aditya has an interesting suggestion. What Aditya is saying is that let me not
What Aditya is saying is that let me not focus on the structure of the PDF. I I
focus on the structure of the PDF. I I will actually write down all of your
will actually write down all of your suggestions over here. So the first
suggestions over here. So the first suggestion is based on the structure
suggestion is based on the structure right. So if we do
right. So if we do based on headings
um what what else then the suggestion by Adita is with respect to similar topics
Adita is with respect to similar topics or semantics
then JP document structure. So here let me
me bucket this in this segment itself and
bucket this in this segment itself and let me call this as a document structure
let me call this as a document structure at the moment.
at the moment. If they are bit big divided into
If they are bit big divided into paragraph limited amount of word size
paragraph limited amount of word size has to be the same maximum number of
has to be the same maximum number of tokens LLM can handle. So let me do the
tokens LLM can handle. So let me do the third category as fixed.
third category as fixed. So when I say fixed maybe it's 10
So when I say fixed maybe it's 10 sentences as one token
sentences as one token or 10 words as one token
or 10 words as one token or one word as one token
or one word as one token whatever this is fixed size chunking. So
whatever this is fixed size chunking. So intuitively if this terminology of
intuitively if this terminology of chunking was not known to me
chunking was not known to me or if I had not studied the literature
or if I had not studied the literature of lit retrieval augmented generation I
of lit retrieval augmented generation I would have intuitively said that one
would have intuitively said that one chunk is one section because when I read
chunk is one section because when I read a piece of PDF my mind thinks in terms
a piece of PDF my mind thinks in terms of sections right so if a certain
of sections right so if a certain question is asked by the user ideally
question is asked by the user ideally you should retrieve a full section and
you should retrieve a full section and give it as the answer right I don't want
give it as the answer right I don't want to just retrieve few sentences.
to just retrieve few sentences. I I want to retrieve entire sections and
I I want to retrieve entire sections and pass it. That's why I think chunking
pass it. That's why I think chunking should be done section wise.
should be done section wise. That can be one example. There are
That can be one example. There are people who are mentioning recursive work
people who are mentioning recursive work plus overlap. For some people this might
plus overlap. For some people this might not be clear. So I'll come to that
not be clear. So I'll come to that eventually.
eventually. Um okay. So that's the intuition which
Um okay. So that's the intuition which comes to my mind.
comes to my mind. Now what we can do is that let's go
Now what we can do is that let's go through the five types of chunking which
through the five types of chunking which we are going to see. Um and then towards
we are going to see. Um and then towards the end we will also have an engineer
the end we will also have an engineer choice section on which chunking
choice section on which chunking strategy to use and then we will code
strategy to use and then we will code the different chunking strategies and
the different chunking strategies and actually see their similarities and
actually see their similarities and differences. So my hope is that after
differences. So my hope is that after this section all of you should
this section all of you should understand the trade-offs. So at the
understand the trade-offs. So at the start of the lecture I mentioned about
start of the lecture I mentioned about trade-offs right? There are a lot of
trade-offs right? There are a lot of trade-offs in different chunking
trade-offs in different chunking strategies. There is no one-sizefits-all
strategies. There is no one-sizefits-all approach and uh different chunking
approach and uh different chunking strategies definitely lead to different
strategies definitely lead to different results.
results. In fact, what we have done here is that
we have actually made a PDF. I'm trying to find that PDF right now. Just a
to find that PDF right now. Just a minute.
Yeah. So within our company, we have made this PDF of chunking strategies.
made this PDF of chunking strategies. I'll share this with all of you
I'll share this with all of you where this guide is especially meant for
where this guide is especially meant for what are the different type of chunking
what are the different type of chunking strategies and which chunking strategy
strategies and which chunking strategy to use at what time. This is one of the
to use at what time. This is one of the most important things to understand for
most important things to understand for engineers especially and my main purpose
engineers especially and my main purpose with this workshop is how to make
with this workshop is how to make engineering decisions like this. But to
engineering decisions like this. But to make engineering decisions like this
make engineering decisions like this first we have to understand what are the
first we have to understand what are the different chunking strategies right
different chunking strategies right u so let's let's start now before even
u so let's let's start now before even evaluating between different chunking
evaluating between different chunking strategies or coding all of you need to
strategies or coding all of you need to understand what is exactly done some
understand what is exactly done some chunking strategies are easy to
chunking strategies are easy to understand some are slightly more
understand some are slightly more detailed but each of them serve a
detailed but each of them serve a specific purpose so first let's go with
specific purpose so first let's go with fixed size chunking so in fixed size
fixed size chunking so in fixed size chunking what is actually done is that
chunking what is actually done is that let's say let's actually take a PDF
let's say let's actually take a PDF let's take this legal services agreement
let's take this legal services agreement let's say you are making a rack system
let's say you are making a rack system for a legal domain right and if you have
for a legal domain right and if you have a PDF which looks like this
a PDF which looks like this where there is responsibilities of law
where there is responsibilities of law firm and client whatever in fixed
firm and client whatever in fixed chunking strategies what you mention is
chunking strategies what you mention is that I will uh have every chunk to be of
that I will uh have every chunk to be of a fixed size so let's say my chunk
a fixed size so let's say my chunk is of uh
is of uh let's say 200 words. So all my chunks
let's say 200 words. So all my chunks are going to be 200 words. I'm not going
are going to be 200 words. I'm not going to look at anything else. My chunk one
to look at anything else. My chunk one is going to be 200 words. My chunk two
is going to be 200 words. My chunk two is going to be 200 words. That's it. And
is going to be 200 words. That's it. And I can also have a slight overlap between
I can also have a slight overlap between these chunks so as to make sure that
these chunks so as to make sure that some amount of context is retained.
some amount of context is retained. But can you tell me what's the drawback
But can you tell me what's the drawback with this approach? What's the
with this approach? What's the advantages and what's the disadvantages
advantages and what's the disadvantages with this approach according to you? So
with this approach according to you? So now here again try to think from first
now here again try to think from first principles right imagine you are making
principles right imagine you are making this rag system where you want to make a
this rag system where you want to make a chatbot where a customer asks something
chatbot where a customer asks something about an agreement and your chatbot
about an agreement and your chatbot should answer. Now the retrieved
should answer. Now the retrieved information will come in chunks. Why or
information will come in chunks. Why or why not should you go ahead with a fixed
why not should you go ahead with a fixed chunking strategy like this with each
chunking strategy like this with each chunk being of 200 words
chunk being of 200 words incomplete responses
incomplete responses uh context not called sentence cut in
uh context not called sentence cut in between
between lacks contextual overlap.
lacks contextual overlap. So yeah let's take a look at this
So yeah let's take a look at this example itself right where it is being
example itself right where it is being cut. So responsibilities of law firm and
cut. So responsibilities of law firm and client this should ideally be one full
client this should ideally be one full section right and I want this entire
section right and I want this entire thing to be passed into my retrieved
thing to be passed into my retrieved information but because of this chunking
information but because of this chunking what has happened is that let's say this
what has happened is that let's say this chunk has responsibilities of law firm
chunk has responsibilities of law firm and client right so when a user asks on
and client right so when a user asks on a chatbot what are the responsibilities
a chatbot what are the responsibilities of a law firm and client if this is the
of a law firm and client if this is the user asks this this chunk will be
user asks this this chunk will be retrieved
retrieved But this chunk actually does not have
But this chunk actually does not have anything related to it has some amount
anything related to it has some amount of context because we are retaining some
of context because we are retaining some overlap but most of it is related to
overlap but most of it is related to some other sections. So this chunk will
some other sections. So this chunk will not be retrieved which means that we are
not be retrieved which means that we are actually losing out on this this much
actually losing out on this this much amount of information which is
amount of information which is completely relevant to our current
completely relevant to our current section.
That's one major disadvantage with fixed size chunking. Chunks can be made in the
size chunking. Chunks can be made in the middle of important paragraphs. Chunks
middle of important paragraphs. Chunks can be made in the middle of sentences.
can be made in the middle of sentences. A good question is asked, won't
A good question is asked, won't embeddings create a match for similar
embeddings create a match for similar text? Embeddings will create a match.
text? Embeddings will create a match. But what if your chunk is formed at a
But what if your chunk is formed at a place where
place where there is nothing with respect to the
there is nothing with respect to the question which is asked. Let's say if
question which is asked. Let's say if it's just two sentences at the end of a
it's just two sentences at the end of a paragraph where the context of what
paragraph where the context of what comes before is lost. If your chunk
comes before is lost. If your chunk unluckily comes at that point where in
unluckily comes at that point where in that particular section the information
that particular section the information of the title is lost
of the title is lost then that paragraph won't be retrieved
then that paragraph won't be retrieved and currently I'm just showing a small
and currently I'm just showing a small paragraph right if you have a huge
paragraph right if you have a huge paragraph related to a section and if
paragraph related to a section and if you randomly make a chunk halfway some
you randomly make a chunk halfway some of your information can be lost in the
of your information can be lost in the retrieved chunks.
Uh can the chunks be linked? So the chunks cannot be linked because let's
chunks cannot be linked because let's say we have chunks.
say we have chunks. Uh when you say chunks linked that leads
Uh when you say chunks linked that leads to structural chunking actually which
to structural chunking actually which will come later. In fixedsize chunking
will come later. In fixedsize chunking this is the main issue. So then why
this is the main issue. So then why would anyone do fixed size chunking? Can
would anyone do fixed size chunking? Can can you think of an application where
can you think of an application where people do fixed size chunking?
So one one lesson which all of us learned at the moment is that if your
learned at the moment is that if your document has a structure like sections,
document has a structure like sections, subsections etc. Never go with fixed
subsections etc. Never go with fixed size chunking because it might cut a
size chunking because it might cut a section halfway. Fixed size chunking is
section halfway. Fixed size chunking is used in places where you want fast
used in places where you want fast processing. Let's say if you have
processing. Let's say if you have millions and billions of documents,
millions and billions of documents, right? um or hundreds of thousands of
right? um or hundreds of thousands of documents. And if you want a quick
documents. And if you want a quick strategy without too much overhead, if
strategy without too much overhead, if you want the speed of processing to be
you want the speed of processing to be quick, then you go ahead with a fixed
quick, then you go ahead with a fixed size junking
size junking because it will just be very fast. Like
because it will just be very fast. Like if you are collecting information from
if you are collecting information from Reddit, if you are collecting
Reddit, if you are collecting information from Twitter, mostly the
information from Twitter, mostly the information will be disorganized in
information will be disorganized in threads, in comments, uh no clear
threads, in comments, uh no clear structure, no clear subheading, random
structure, no clear subheading, random messy information but huge amount of
messy information but huge amount of information. If you have random messy
information. If you have random messy chaotic information which is huge in
chaotic information which is huge in number and if you want to process it
number and if you want to process it quickly, you can use chunking uh you can
quickly, you can use chunking uh you can use fixed size chunking with some
use fixed size chunking with some overlap.
overlap. Now these are the advantages and
Now these are the advantages and disadvantages of fixed size chunking.
disadvantages of fixed size chunking. Quick fast processing is the advantage.
Quick fast processing is the advantage. The disadvantage is that it has semantic
The disadvantage is that it has semantic breaks and the context is lost.
breaks and the context is lost. Uh so the strategy is best used in
Uh so the strategy is best used in scenarios where documents are large and
scenarios where documents are large and numerous and a quick segmentation is
numerous and a quick segmentation is needed without requiring deep
needed without requiring deep understanding of the context.
understanding of the context. uh for instance if you are processing
uh for instance if you are processing millions of web pages for indexing and
millions of web pages for indexing and can tolerate some loss of coherence in
can tolerate some loss of coherence in chunks fixed size chunking is a viable
chunks fixed size chunking is a viable approach.
approach. Now remember that as the size of your
Now remember that as the size of your chunk increases your embedding model
chunk increases your embedding model size also needs to increase
size also needs to increase proportionately.
proportionately. So keep that in mind that's a trade-off
So keep that in mind that's a trade-off with larger chunks.
One other use may be in streaming or sequential processing. Yeah, correct. As
sequential processing. Yeah, correct. As it's easy to handle streams of text
it's easy to handle streams of text without worrying about sentence breaks.
without worrying about sentence breaks. Agreed. Another use is book like
Agreed. Another use is book like fountain head. Yeah, sure. if you have
fountain head. Yeah, sure. if you have books or u let's say if I go to
yeah take a look at this book right it's a huge book which with which has no
a huge book which with which has no structure it has no no headings no
structure it has no no headings no subheadings
subheadings such kind of text it might be good idea
such kind of text it might be good idea to maybe go ahead with fixed size
to maybe go ahead with fixed size chunking and if you have thousand such
chunking and if you have thousand such books then definitely go ahead with
books then definitely go ahead with fixed size chunking so so let's say if
fixed size chunking so so let's say if you're doing a project on project
you're doing a project on project Gutenberg
Gutenberg and your task is to transcribe all the
and your task is to transcribe all the books let's say and come up with some
books let's say and come up with some sort of a rack system it might be better
sort of a rack system it might be better to go ahead with fixed size chunking
to go ahead with fixed size chunking okay that's the first strategy The
okay that's the first strategy The second strategy is what someone already
second strategy is what someone already mentioned in the chat. Now again here
mentioned in the chat. Now again here I'm taking the same example which I
I'm taking the same example which I showed you over here.
Now let's say you take a book from here the same book
you take a book from here the same book which we saw. Um the main issue again
which we saw. Um the main issue again with fixed size chunking is that
with fixed size chunking is that although it's fast it does not retain
although it's fast it does not retain anything about semantics. Right?
anything about semantics. Right? It does not retain anything about
It does not retain anything about meaning. There is no meaning between one
meaning. There is no meaning between one chunk or another chunk. Semantic
chunk or another chunk. Semantic chunking tries to solve this issue. So
chunking tries to solve this issue. So the way semantic chunking works is that
the way semantic chunking works is that first you have to define a level of
first you have to define a level of organization.
organization. So by level of organization,
So by level of organization, I mean whether it's at a sentence level.
I mean whether it's at a sentence level. So if I want a sentence level
So if I want a sentence level organization, what I will do is that I
organization, what I will do is that I will take the first sentence, right? So
will take the first sentence, right? So let's say you have chunk number one.
let's say you have chunk number one. And that's like a box.
And that's like a box. I will take my first sentence. I will
I will take my first sentence. I will add it to the box. Okay. Then what I
add it to the box. Okay. Then what I will do is that I will take my second
will do is that I will take my second sentence.
sentence. I will
I will compare the embedding of this sentence
compare the embedding of this sentence and this sentence. So I will so let's
and this sentence. So I will so let's say this is sentence one and sentence
say this is sentence one and sentence two. Um
two. Um so sentence one will be converted into a
so sentence one will be converted into a vector embedding.
Sentence two will be converted into a vector embedding and I will check if the
vector embedding and I will check if the similarity score between these two
similarity score between these two vector embeddings is greater than a
vector embeddings is greater than a threshold. Let's say 8.
threshold. Let's say 8. If it's greater than this threshold, I
If it's greater than this threshold, I know that both of these sentences are
know that both of these sentences are kind of meaning the same. So then I will
kind of meaning the same. So then I will add this second sentence also over here
add this second sentence also over here because it passes my similarity
because it passes my similarity criteria. Then what I will do is that I
criteria. Then what I will do is that I will again go to the third sentence.
will again go to the third sentence. I will embed the third sentence into a
I will embed the third sentence into a vector and I will compare its cosine
vector and I will compare its cosine similarity with sentence number one. If
similarity with sentence number one. If it again passes the threshold, I will
it again passes the threshold, I will add it to my chunk.
add it to my chunk. I will keep on doing this until I have
I will keep on doing this until I have sentences which have good cosine
sentences which have good cosine similarity with my original sentence.
similarity with my original sentence. And the moment I encounter a sentence
And the moment I encounter a sentence which does not pass this criteria, the
which does not pass this criteria, the moment I encounter a sentence whose
moment I encounter a sentence whose cosign similarity is less than this
cosign similarity is less than this threshold, I will stop this chunk.
threshold, I will stop this chunk. So that's my chunk one. It's done. Then
So that's my chunk one. It's done. Then I move to chunk number two.
I move to chunk number two. What this will ensure is that every
What this will ensure is that every chunk will have semantically similar
chunk will have semantically similar information.
information. So let's say when I go to this, let's
So let's say when I go to this, let's say I want the initial section is all
say I want the initial section is all about a drama happening between a
about a drama happening between a family. I want to have this chunk until
family. I want to have this chunk until that drama finishes. So whenever certain
that drama finishes. So whenever certain question is asked, I will only retrieve
question is asked, I will only retrieve the chunk whose semantic meaning is
the chunk whose semantic meaning is matching.
matching. That's where semantic chunking actually
That's where semantic chunking actually has an advantage over fixed size
has an advantage over fixed size chunking. It takes into account the
chunking. It takes into account the meaning. So I know that every chunk will
meaning. So I know that every chunk will have similarity in meaning.
Amit is asking what is chunking strategy used in notebook LM? Notebook LM
used in notebook LM? Notebook LM definitely uses I think chunking which
definitely uses I think chunking which takes semantics into account. So maybe
takes semantics into account. So maybe something similar to the semantic
something similar to the semantic chunking which we are looking at right
chunking which we are looking at right now.
now. If the sentences have sim high
If the sentences have sim high similarity, isn't it better to avoid
similarity, isn't it better to avoid them? Um so good question but you will
them? Um so good question but you will never be sure that why the similarity is
never be sure that why the similarity is high right so you might lose information
high right so you might lose information that way instead
that way instead two sentences can mean something similar
two sentences can mean something similar but they are in different contexts you
but they are in different contexts you can still have them so let's say if
can still have them so let's say if you're talking about forests you can be
you're talking about forests you can be talking about trees in the forest or you
talking about trees in the forest or you can be talking about taking a trip to
can be talking about taking a trip to the forest and there semantics maybe the
the forest and there semantics maybe the vector embeddings are matching so you
vector embeddings are matching so you won't want to neglect one compared to
won't want to neglect one compared to the other. Right?
the other. Right? So semantic chunking main advantage is
So semantic chunking main advantage is of course it maintains coherence and it
of course it maintains coherence and it is used
is used it is used in settings where integrity
it is used in settings where integrity of ideas is very important. So for
of ideas is very important. So for example let's say if you have uh let's
example let's say if you have uh let's say if you are listening to a parliament
say if you are listening to a parliament debate.
Let's say uh you are listening to a parliament debate.
parliament debate. and you have collected the transcripts
and you have collected the transcripts right you want to make a rag system
right you want to make a rag system related to uh you ask the question and
related to uh you ask the question and then you want to identify what was
then you want to identify what was discussed in this parliamentary debate
discussed in this parliamentary debate now usually I don't know if you have
now usually I don't know if you have seen but parliamentary debates are some
seen but parliamentary debates are some of the most unstructured and they can
of the most unstructured and they can get chaotic they can get messy
get chaotic they can get messy and but there is a flow of ideas there
and but there is a flow of ideas there is a flow of ideas in these debates
is a flow of ideas in these debates someone talks something someone else
someone talks something someone else negates it. Usually we don't know till
negates it. Usually we don't know till what time that negation proceeds or
what time that negation proceeds or let's say we don't have a clear split
let's say we don't have a clear split but ideas are there and clearly ideas
but ideas are there and clearly ideas belong in buckets
belong in buckets for such kind of transcript
for such kind of transcript rack system I would go ahead with
rack system I would go ahead with semantic chunking because I would want
semantic chunking because I would want to preserve the integrity of an idea
to preserve the integrity of an idea till the time it is discussed in one
till the time it is discussed in one chunk.
chunk. This is very sim very similar to
This is very sim very similar to educational transcripts. Let's say if
educational transcripts. Let's say if you watch a video and if you make a
you watch a video and if you make a transcript out of this, right? So let's
transcript out of this, right? So let's say
say this same video.
Let's say you watch this video and I talk about four to five things in
and I talk about four to five things in the video. But let's say I have not
the video. But let's say I have not added timestamps and I have not added
added timestamps and I have not added anything. How will you know the key
anything. How will you know the key things which are discussed in the video?
things which are discussed in the video? The only way for you to know is maintain
The only way for you to know is maintain the semantic integrity of chunks. Right?
the semantic integrity of chunks. Right? You cannot do fixed size chunking here.
You cannot do fixed size chunking here. You have to maintain semantic
You have to maintain semantic similarity. So then you will know okay
similarity. So then you will know okay this section talks about bite pair
this section talks about bite pair encoding. This section talks about the
encoding. This section talks about the size of language models. This section
size of language models. This section talks about emergent properties etc.
talks about emergent properties etc. Otherwise there is no way to know from
Otherwise there is no way to know from transcripts. So there are a number of
transcripts. So there are a number of cases when maintaining
the the semantic similarity in one chunk plays
semantic similarity in one chunk plays to our advantage.
to our advantage. And again the drawback of course there
And again the drawback of course there is no free lunch. there is no free lunch
is no free lunch. there is no free lunch and that's why the main drawback is that
and that's why the main drawback is that this kind of a strategy is extremely
this kind of a strategy is extremely complex and it takes a lot of
complex and it takes a lot of computational power because you have to
computational power because you have to convert every single sentence into an
convert every single sentence into an embedding right so that's not easy again
embedding right so that's not easy again major issue is you have a hyperparameter
major issue is you have a hyperparameter here which is the threshold you have a
here which is the threshold you have a hyperparameter here also the number of
hyperparameter here also the number of tokens in a chunk but here you have the
tokens in a chunk but here you have the threshold and you have no clue what this
threshold and you have no clue what this threshold sensitivity should
threshold sensitivity should uh at least this hyperparameter you kind
uh at least this hyperparameter you kind of have an idea that 200 words means
of have an idea that 200 words means this much but here it's completely vague
this much but here it's completely vague another thing is in inconsistent chunk
another thing is in inconsistent chunk sizes so some chunk sizes might be very
sizes so some chunk sizes might be very huge that might be an issue for our LLM
huge that might be an issue for our LLM context etc
context etc let me see if there are any questions in
let me see if there are any questions in the chat
the chat you took one sentence at a time and then
you took one sentence at a time and then I lost how semantics is maintained Do
I lost how semantics is maintained Do you scan the entire document? Yeah. So
you scan the entire document? Yeah. So basically the way it is done samarat is
basically the way it is done samarat is that it done sentence wise. So you take
that it done sentence wise. So you take this sentence number one. Okay you are
this sentence number one. Okay you are doing sentence by sentence. So you take
doing sentence by sentence. So you take the sentence number one you add it to a
the sentence number one you add it to a chunk. You keep on adding sentence
chunk. You keep on adding sentence subsequent sentences to the same chunk
subsequent sentences to the same chunk until their cosine similarity with the
until their cosine similarity with the first chunk with the first sentence is
first chunk with the first sentence is above a certain value. The moment you
above a certain value. The moment you encounter a sentence whose cosign
encounter a sentence whose cosign similarity is not
similarity is not higher than the threshold right from
higher than the threshold right from that moment you start forming the second
that moment you start forming the second chunk then you form the third chunk that
chunk then you form the third chunk that like that you sequentially go through
like that you sequentially go through your entire text and keep on forming
your entire text and keep on forming chunks.
chunks. Does semantic chunking require
Does semantic chunking require premputing embeddings? Is it done at
premputing embeddings? Is it done at runtime? There are both options actually
runtime? There are both options actually uh
uh nowadays actually people have started
nowadays actually people have started using runtime querying so you can do
using runtime querying so you can do that during runtime but most rag
that during runtime but most rag applications I have seen they maintain
applications I have seen they maintain embeddings
embeddings what happens if the idea in chunk one
what happens if the idea in chunk one and again come up somewhere after that's
and again come up somewhere after that's a great question actually uh yeah then
a great question actually uh yeah then unfortunately that needs to be a
unfortunately that needs to be a separate chunk but if the idea is close
separate chunk but if the idea is close to the first idea and if you're
to the first idea and if you're retrieving four or five chunks hopefully
retrieving four or five chunks hopefully both those chunks show up right
both those chunks show up right let's say if you make a chunk which has
let's say if you make a chunk which has certain idea and that idea comes later
certain idea and that idea comes later at the end of the document if both ideas
at the end of the document if both ideas are very similar both those chunks will
are very similar both those chunks will be retrieved at the end
be retrieved at the end wouldn't it be a better strategy to
wouldn't it be a better strategy to check cosine sim it will be I I agree
check cosine sim it will be I I agree but uh
but uh the whole idea is that again the time
the whole idea is that again the time also increases right if you want to
also increases right if you want to check the semantic similarity with all
check the semantic similarity with all the previous sentences.
the previous sentences. It's a bit time consuming also. You kind
It's a bit time consuming also. You kind of hope that the cosine similarity
of hope that the cosine similarity formula is such that if you take
formula is such that if you take dotproduct of two vectors a dot b is
dotproduct of two vectors a dot b is higher and b dot c. So if you think in
higher and b dot c. So if you think in terms of angle if a dot b is higher they
terms of angle if a dot b is higher they are similar. If b dot c is higher b and
are similar. If b dot c is higher b and c are also having similar angles.
c are also having similar angles. Uh so you can say that a and c will also
Uh so you can say that a and c will also be somewhat similar to each other.
Is it based on assumption that next line will be similar to semantically similar
will be similar to semantically similar to yeah uh that is also true that is the
to yeah uh that is also true that is the same thing which was exploited in in
same thing which was exploited in in what's the word to it neighbors usually
what's the word to it neighbors usually carry similar meaning right because you
carry similar meaning right because you would not have random lines usually
would not have random lines usually subsequently added next to each
Samarat has said, "So should we have structure?" Yeah. Yeah. Correct. This
structure?" Yeah. Yeah. Correct. This level of organization which you
level of organization which you mentioned that can also be at a
mentioned that can also be at a paragraph level in semantic chunking.
paragraph level in semantic chunking. If your sentences are not varying too
If your sentences are not varying too much in meaning, you can have one big
much in meaning, you can have one big paragraph as one chunk. But then again
paragraph as one chunk. But then again you will have to do structural chunking
you will have to do structural chunking followed by semantic chunking which is
followed by semantic chunking which is done. I'll come to that later. So that
done. I'll come to that later. So that naturally brings us to actually first
naturally brings us to actually first let me cover structural chunking.
let me cover structural chunking. Structural chunking according to me is
Structural chunking according to me is the most intuitive form of chunking and
the most intuitive form of chunking and this can be com combined with semantic
this can be com combined with semantic chunking also.
chunking also. Structural chunking is essentially like
Structural chunking is essentially like let's say you are considering a
let's say you are considering a shareholder letter, right? So if you see
Yeah, if you take a look at the shareholder letter and they are going to
shareholder letter and they are going to release it quarterly with the same kind
release it quarterly with the same kind of sections. Structural chunking takes
of sections. Structural chunking takes advantage of that approach where it says
advantage of that approach where it says that
that we are going to split the report exactly
we are going to split the report exactly at these section boundaries. The first
at these section boundaries. The first chunk is going to be letter to
chunk is going to be letter to shareholders. The second chunk is going
shareholders. The second chunk is going to be introduction. The third chunk is
to be introduction. The third chunk is going to be company overview. The fourth
going to be company overview. The fourth chunk is going to be financial
chunk is going to be financial statements. The fifth chunk is going to
statements. The fifth chunk is going to be notes to the financial statements.
be notes to the financial statements. Sixth chunk is going to be conclusion
Sixth chunk is going to be conclusion and outlook. That's it. It's extremely
and outlook. That's it. It's extremely simple, right? And believe it or not in
simple, right? And believe it or not in industrial problems structural chunking
industrial problems structural chunking solves many issues because
solves many issues because it depends if you are in a financial
it depends if you are in a financial sector or if you're in a medical sector
sector or if you're in a medical sector and if you are looking at a very
and if you are looking at a very specific rag application
specific rag application it is very likely that the application
it is very likely that the application stays the same across multiple
stays the same across multiple documents. For example, if you're
documents. For example, if you're building a conversational therapist rack
building a conversational therapist rack chatbot, the therapist might be making
chatbot, the therapist might be making notes after each session in a specific
notes after each session in a specific format. So the therapist might be
format. So the therapist might be writing introduction or the key things
writing introduction or the key things which we discussed in the session, key
which we discussed in the session, key takeaways. So as long as you know the
takeaways. So as long as you know the structure of your documents, structural
structure of your documents, structural chunking is the most intuitive and the
chunking is the most intuitive and the best thing you can do when you receive
best thing you can do when you receive any problem. If the problem is a bit
any problem. If the problem is a bit more structured, if it's messy like what
more structured, if it's messy like what we have seen here, then of course it
we have seen here, then of course it will not work. But if let's say if you
will not work. But if let's say if you have hospital records or if you have
have hospital records or if you have stock price information in a specific
stock price information in a specific tabular format or in a specific
tabular format or in a specific structure format, you can always
structure format, you can always leverage that structure. The more you
leverage that structure. The more you leverage structures in a documents, the
leverage structures in a documents, the more grounded your retrieval augmented
more grounded your retrieval augmented generation system is going to be hands
generation system is going to be hands down at all times. So the first strategy
down at all times. So the first strategy which I always intuitively also it comes
which I always intuitively also it comes naturally to me is just go to chunk go
naturally to me is just go to chunk go to structure level chunks right
uh but then what are the issues of structural chunking?
Can you think of any issues with structurebased chunking? In fact, many
structurebased chunking? In fact, many of you when you saw this document, the
of you when you saw this document, the first thing which intuitively came to
first thing which intuitively came to mind is structure based on sections and
mind is structure based on sections and subsections. That's exactly structural
subsections. That's exactly structural based chunking. What are the issues with
based chunking. What are the issues with this?
Yeah, the issue with this is that the u
u one chunk can usually be very large
one chunk can usually be very large because what if in one particular
because what if in one particular shareholder letter the introduction
shareholder letter the introduction section is five times longer than
section is five times longer than others.
others. Then the chunk size becomes very large
Then the chunk size becomes very large then it that chunk will be retrieved to
then it that chunk will be retrieved to the language model right and will be
the language model right and will be added to its context. So then the
added to its context. So then the context window of the language model
context window of the language model becomes again very large. So the same
becomes again very large. So the same problem which we set out to solve, we
problem which we set out to solve, we are again encountering the same issue
are again encountering the same issue again. So the advantage of a structured
again. So the advantage of a structured approach is that it's very good for
approach is that it's very good for documents whose data comes in structured
documents whose data comes in structured format like section, section, subsection
format like section, section, subsection etc. But it's actually
etc. But it's actually not very good in terms of the fact that
not very good in terms of the fact that it can make chunks which are huge and
it can make chunks which are huge and that might increase the context length
that might increase the context length of LLMs and that might again lead to
of LLMs and that might again lead to more hallucinations.
How many of you actually know what metadata is?
uh can you so why have I mentioned metadata over here in structure based
metadata over here in structure based thinking in structure based chunking
thinking in structure based chunking yeah so data about data is metadata
yeah so data about data is metadata right essentially
right essentially if I know that a chunk belongs to a
if I know that a chunk belongs to a particular structure so let's say when I
particular structure so let's say when I store a particular chunk I also store
store a particular chunk I also store its metadata
its metadata that if I store an introduction chunk I
that if I store an introduction chunk I also uh store that it's an introduction
also uh store that it's an introduction chunk
chunk I might refer to it later.
I might refer to it later. So later if I want to collect all the
So later if I want to collect all the introductions, I might refer to this
introductions, I might refer to this metadata also. So structure based
metadata also. So structure based chunking also has this added advantage
chunking also has this added advantage that since you know which chunk
that since you know which chunk corresponds to which structure. For
corresponds to which structure. For example, you know that this chunk
example, you know that this chunk corresponds to company overview. you can
corresponds to company overview. you can store that as a metadata and then you
store that as a metadata and then you can access that metadata later
can access that metadata later downstream in your application if there
downstream in your application if there is a need.
Uh now Samrat had also mentioned that can we in semantic chunking instead of
can we in semantic chunking instead of having sentence as chunks
having sentence as chunks or instead of level of organization at
or instead of level of organization at the sentence level can we have the level
the sentence level can we have the level of organization at a paragraph level. So
of organization at a paragraph level. So one paragraph will be added then the
one paragraph will be added then the semantic similarity with other paragraph
semantic similarity with other paragraph will be compared. So if you want to do
will be compared. So if you want to do that approach you are essentially
that approach you are essentially combining structural chunking with
combining structural chunking with semantic chunking because first you you
semantic chunking because first you you will use structural chunking to find the
will use structural chunking to find the paragraphs
paragraphs then you will use semantic chunking on
then you will use semantic chunking on top of that. So that's a combined
top of that. So that's a combined approach and normally if one type of
approach and normally if one type of chunking fails it's very common to
chunking fails it's very common to combine two types of chunking methods
combine two types of chunking methods also.
also. So this disadvantage of structural
So this disadvantage of structural chunking which we saw the main
chunking which we saw the main disadvantage is that some chunks can be
disadvantage is that some chunks can be too large that is solved by recursive
too large that is solved by recursive chunking.
chunking. Recursive chunking is an amazing
Recursive chunking is an amazing chunking strategy because it's kind of
chunking strategy because it's kind of the best of both worlds. It exploits the
the best of both worlds. It exploits the structure. So it exploits the structure
structure. So it exploits the structure of
of documents
but it also kind of makes sure that chunk size remains consistent.
How does it do it? So let's take a practical example actually.
Yeah, let's take a look at let's say you are building a rag chatbot
let's say you are building a rag chatbot which analyzes research papers.
which analyzes research papers. Now you know that if you are analyzing
Now you know that if you are analyzing research papers belonging to a
research papers belonging to a particular journal, the structure is
particular journal, the structure is going to remain the same, right? If I'm
going to remain the same, right? If I'm looking at patterns, they don't accept
looking at patterns, they don't accept papers if the structure is too
papers if the structure is too different. So, I know that this is going
different. So, I know that this is going to have a
to have a uh I know that this is going to have
uh I know that this is going to have some kind of an introduction section for
some kind of an introduction section for sure. It's going to have a summary
sure. It's going to have a summary summary se section. It's going to have a
summary se section. It's going to have a results section. Uh it's going to have
results section. Uh it's going to have finally the conclusion and discussion
finally the conclusion and discussion section. Yeah. And then towards the end
section. Yeah. And then towards the end there will be references and then it
there will be references and then it will end. You know this is the structure
will end. You know this is the structure but in some papers the result section
but in some papers the result section can be too long compared to other
can be too long compared to other papers. So you cannot use structural
papers. So you cannot use structural chunking. Simple thing is to just use
chunking. Simple thing is to just use structural chunking right and each
structural chunking right and each section can be one chunk. So what
section can be one chunk. So what recursive chunking does is that first I
recursive chunking does is that first I will make chunks based on my sections
will make chunks based on my sections right. So introduction section will be
right. So introduction section will be one chunk, result section will be one
one chunk, result section will be one chunk etc. Then I will look at my chunk
chunk etc. Then I will look at my chunk size and I will define a maximum chunk
size and I will define a maximum chunk size. If the maximum chunk size is 500,
size. If the maximum chunk size is 500, 500 tokens.
500 tokens. If one of my chunks is greater than the
If one of my chunks is greater than the maximum chunk size, I will chunk it
maximum chunk size, I will chunk it again.
again. How will I chunk it again? I'll have to
How will I chunk it again? I'll have to define one more level of chunking. So if
define one more level of chunking. So if the result section becomes too big, I'll
the result section becomes too big, I'll say that I'll chunk it at the paragraph
say that I'll chunk it at the paragraph level.
level. So then I'll again chunk based on
So then I'll again chunk based on paragraphs in the result section. So
paragraphs in the result section. So then each of these paragraphs will then
then each of these paragraphs will then become a separate chunk
become a separate chunk and then I what I do is that then I
and then I what I do is that then I again go to the paragraph level
again go to the paragraph level uh and then I see whether each token is
uh and then I see whether each token is greater than a particular token is
greater than a particular token is greater than my maximum size and if some
greater than my maximum size and if some paragraph like this is greater than the
paragraph like this is greater than the maximum size I will chunk it further to
maximum size I will chunk it further to another level which is my sentence level
another level which is my sentence level and then I will again check whether the
and then I will again check whether the number of tokens are greater or not. So
number of tokens are greater or not. So if you think about it, it's like that
if you think about it, it's like that kind of rush and all approach, right?
kind of rush and all approach, right? Where you take the a large level
Where you take the a large level chunking where you take section level
chunking where you take section level chunking,
chunking, then within that
then within that you have
you have paragraph level chunking.
paragraph level chunking. So you do paragraph level chunking only
So you do paragraph level chunking only when
only when chunk size is greater than the maximum chunk size.
maximum chunk size. So you do paragraph level chunking. Then
So you do paragraph level chunking. Then in paragraph level chunking again if the
in paragraph level chunking again if the chunk size is higher you do your final
chunk size is higher you do your final level of chunking which is your sentence
level of chunking which is your sentence level of chunking.
level of chunking. So here again you check whether the
So here again you check whether the chunk size is greater than the maximum.
chunk size is greater than the maximum. So since we are using different level of
So since we are using different level of chunkings one below each other. This
chunkings one below each other. This method is also called as recursive
method is also called as recursive chunking.
And the reason recursive chunking is the best of both worlds is because it's
best of both worlds is because it's preserving structure for sure, but it's
preserving structure for sure, but it's also making sure that none of my chunks
also making sure that none of my chunks are too large. So that won't affect my
are too large. So that won't affect my context size at all.
Uh let's see what if we comp combine structural with
what if we comp combine structural with semantic. So this we already discussed.
semantic. So this we already discussed. Is it possible to apply chunking
Is it possible to apply chunking strategies to images and videos in
strategies to images and videos in multi- model models? David, that's a
multi- model models? David, that's a great question. Actually, it is
great question. Actually, it is definitely possible to do that. So,
definitely possible to do that. So, think of images and videos in terms of
think of images and videos in terms of tokens, right?
tokens, right? Just like I'm talking about tokens for
Just like I'm talking about tokens for text. Images and videos also have
text. Images and videos also have tokens. They have different token
tokens. They have different token schemes. Uh
schemes. Uh but there are the tokens will be
but there are the tokens will be different. The tokens will be at an
different. The tokens will be at an image level. Uh so there you have to use
image level. Uh so there you have to use again you can use similar strategies but
again you can use similar strategies but the strategies are a bit more different
the strategies are a bit more different than what we are currently covering.
than what we are currently covering. Can you define chunk size? Yeah. Yeah.
Can you define chunk size? Yeah. Yeah. So basically
So basically one hyperparameter we have to define
one hyperparameter we have to define here is that
here is that I will define a maximum chunk size
I will define a maximum chunk size myself.
myself. Before I do recursive chunking I have to
Before I do recursive chunking I have to define a maximum chunk size. Let's say
define a maximum chunk size. Let's say that's going to be 500 tokens.
So at every stage I'm going to compare whether my chunks are greater than this
whether my chunks are greater than this size or not. So if I do section level
size or not. So if I do section level chunking each section chunk I will check
chunking each section chunk I will check its number of tokens. If it's greater
its number of tokens. If it's greater than 500 I'll do the second level of
than 500 I'll do the second level of recursive chunking which is paragraph.
recursive chunking which is paragraph. Then again if it's greater than 500 I'll
Then again if it's greater than 500 I'll do sentence level of chunking.
How is this different uh from fixed size chunking?
chunking? So it's completely different than fixed
So it's completely different than fixed size chunking, right? Because in fixed
size chunking, right? Because in fixed size chunking nowhere we are thinking
size chunking nowhere we are thinking about the structure.
about the structure. In fixed size chunking I will just start
In fixed size chunking I will just start randomly from my start and I will if my
randomly from my start and I will if my fixed size is 50 tokens I will take this
fixed size is 50 tokens I will take this 50 that will be my one chunk. I'll take
50 that will be my one chunk. I'll take this 50 that will be my second chunk.
this 50 that will be my second chunk. I'll take this 50 that will be my third
I'll take this 50 that will be my third chunk. Here what we are doing is that we
chunk. Here what we are doing is that we are doing structural chunking first. So
are doing structural chunking first. So first we break it down into sections. If
first we break it down into sections. If each section does not have too many
each section does not have too many characters or tokens then our chunking
characters or tokens then our chunking will be at the section level. Only if
will be at the section level. Only if one section is larger than a token size.
one section is larger than a token size. We will break it down further.
We will break it down further. Did everyone understand how this is
Did everyone understand how this is different from fixed size chunking?
different from fixed size chunking? Recursive chunking is actually
Recursive chunking is actually completely different than fixed size
completely different than fixed size chunking.
chunking. There is no similarity at all between
There is no similarity at all between recursive chunking and fixed size
recursive chunking and fixed size chunking because in recursive chunking
chunking because in recursive chunking we are not mentioning the size which we
we are not mentioning the size which we want. We are just mentioning the maximum
want. We are just mentioning the maximum chunk size.
There is a question how is the link saved in this chunk? It's not in
saved in this chunk? It's not in structural based chunking and recursive
structural based chunking and recursive chunking. The semantic notion is not
chunking. The semantic notion is not maintained here at all.
Are there libraries to do? Yes, definitely there are libraries. Both
definitely there are libraries. Both lang chain and langraph provide
lang chain and langraph provide libraries to do recursive and structural
libraries to do recursive and structural chunking. But today we are going to
chunking. But today we are going to implement these from scratch in Google
implement these from scratch in Google Collab. We are going to implement all of
Collab. We are going to implement all of these chunking strategies from scratch.
these chunking strategies from scratch. So amit it is not actually maintaining
So amit it is not actually maintaining semantics because nowhere does it know
semantics because nowhere does it know what is mentioned in the section or
what is mentioned in the section or subsection or paragraph or sentence.
subsection or paragraph or sentence. So to those people who asked the
So to those people who asked the question about fixed size versus
question about fixed size versus recursive chunking is it clear how it is
recursive chunking is it clear how it is different? I think sankit asked and
different? I think sankit asked and Krishna also asked if that is your main
Krishna also asked if that is your main question it means there is some
question it means there is some conceptual gap.
conceptual gap. If there is no link, I might well as
If there is no link, I might well as look at the but there is a the the the
look at the but there is a the the the section is maintained, right?
section is maintained, right? So you understand the benefits of
So you understand the benefits of structural chunking
structural chunking the sections are maintained. So think of
the sections are maintained. So think of recursive chunking as a supererset of uh
recursive chunking as a supererset of uh structural chunking. Which means that if
structural chunking. Which means that if you understand the benefits of
you understand the benefits of structural chunking by default you
structural chunking by default you already understand the benefits of
already understand the benefits of recurs recursive chunking
recurs recursive chunking because it is structural chunking but it
because it is structural chunking but it is a bit more clever because it ensures
is a bit more clever because it ensures that each chunk is not greater than a
that each chunk is not greater than a particular size.
Yeah, maintaining section somehow preserves. Yeah, that way that's
preserves. Yeah, that way that's correct. If you maintain a section it
correct. If you maintain a section it kind of preserves what we are talking
kind of preserves what we are talking about. You can think of it that way.
about. You can think of it that way. Then sank.
So in fixed chunking so let let's again go back to this example right of the
go back to this example right of the this example. This example would never
this example. This example would never happen in uh recursive chunking because
happen in uh recursive chunking because in recursive chunking we would have
in recursive chunking we would have defined that this whole thing should be
defined that this whole thing should be a one
a one uh one chunk. Let's say
uh one chunk. Let's say it's just constrained by the maximum
it's just constrained by the maximum chunk size. At the maximum chunk size, I
chunk size. At the maximum chunk size, I can define it to be slightly larger. So
can define it to be slightly larger. So that on an average, it takes into
that on an average, it takes into account all sections and subsections as
account all sections and subsections as one chunk.
Is recursive chunking clear to everyone? Out of all the chunking methods, this
Out of all the chunking methods, this Yeah, it is structural chunking uh on on
Yeah, it is structural chunking uh on on steroids if you think about it.
steroids if you think about it. That's clear, right?
That's clear, right? The last chunking strategy which to
The last chunking strategy which to explore is LLM based chunking. And this
explore is LLM based chunking. And this is that strategy where humans kind of
is that strategy where humans kind of gave up I think because
gave up I think because we are like let let me tell you what
we are like let let me tell you what happens in LLM chunking. Take a look at
happens in LLM chunking. Take a look at this conversation.
this conversation. This is a transcript from about a user
This is a transcript from about a user who is interacting with a chatbot about
who is interacting with a chatbot about this car.
it's this car actually. So Mahindra's car and uh the user is
So Mahindra's car and uh the user is kind of interacting with a bot.
kind of interacting with a bot. The context is that the user wants to
The context is that the user wants to book a test drive
book a test drive but the issue is that first the user
but the issue is that first the user asks about booking a test drive for this
asks about booking a test drive for this car. Okay. Uh
car. Okay. Uh then the user asks to compare
then the user asks to compare prices and uh compare the value for
prices and uh compare the value for money between the two brands. Then the
money between the two brands. Then the user says that let's go with the test
user says that let's go with the test drive for 3xo.
drive for 3xo. Then the user again switches to petrol
Then the user again switches to petrol or diesel option. Then the user again
or diesel option. Then the user again switches to where
switches to where uh could you share your specific
uh could you share your specific location. So user gives this location.
location. So user gives this location. So basically what I want to point out
So basically what I want to point out here is that this is a example of a
here is that this is a example of a conversation which is called as context
conversation which is called as context drift
because in one piece of conversation we are talking about first a test drive
are talking about first a test drive then we are talking about price
then we are talking about price comparison then we are talking about
comparison then we are talking about feature comparison then we are talking
feature comparison then we are talking about location for the test drive then
about location for the test drive then we are talking about petrol or diesel in
we are talking about petrol or diesel in one single conversation there are
one single conversation there are multiple context drifts which are
multiple context drifts which are happening. So if I ask the rag chatbot
happening. So if I ask the rag chatbot what happened in this conversation what
what happened in this conversation what did the user actually talk about
did the user actually talk about then you kind of need a system so that a
then you kind of need a system so that a chunk maintains context across drift
chunk maintains context across drift also.
also. So you need a system where definitely
So you need a system where definitely you need semantics to be maintained
you need semantics to be maintained here. You need semantics to be
here. You need semantics to be maintained and there is again no
maintained and there is again no structure in this conversation. So you
structure in this conversation. So you cannot three options are ruled out for
cannot three options are ruled out for you. You cannot do structural chunking,
you. You cannot do structural chunking, you cannot do recursive chunking and you
you cannot do recursive chunking and you cannot do fixed size chunking. You can
cannot do fixed size chunking. You can do semantic chunking for sure. But even
do semantic chunking for sure. But even in semantic chunking sometimes it might
in semantic chunking sometimes it might get difficult to analyze the entire flow
get difficult to analyze the entire flow of the conversation. Note notice where
of the conversation. Note notice where the drift is happening etc.
the drift is happening etc. So then the only way to do this is to
So then the only way to do this is to bring an LLM into the picture and then
bring an LLM into the picture and then you ask the LLM that start looking at
you ask the LLM that start looking at this entire context and give me those
this entire context and give me those points at which drift is happening and
points at which drift is happening and then break those points into chunks.
then break those points into chunks. So you give that in the prompt itself.
So you give that in the prompt itself. So nothing is defined a priority. You
So nothing is defined a priority. You tell the LLM that this is the cont
tell the LLM that this is the cont entire context which I have go through
entire context which I have go through this entire context and break it down
this entire context and break it down into chunks. how to break it down into
into chunks. how to break it down into chunks. Identify those points at which
chunks. Identify those points at which there is a context drift which is
there is a context drift which is happening uh and then do the chunking
happening uh and then do the chunking yourself.
yourself. So the reason I said humans gave up at
So the reason I said humans gave up at this point is because here we do not
this point is because here we do not really take structure into account. We
really take structure into account. We do not even specify too many things to
do not even specify too many things to constrain ourselves. We just tell LLM
constrain ourselves. We just tell LLM that you have to uh do this let's say.
that you have to uh do this let's say. So the advantage of course is high
So the advantage of course is high semantic accuracy and good for documents
semantic accuracy and good for documents with rapid context changes or
with rapid context changes or unstructured text really.
unstructured text really. So if you are handling voice
So if you are handling voice conversations as a chatbot where a
conversations as a chatbot where a person can ask multiple different things
person can ask multiple different things where context is drifting a lot you need
where context is drifting a lot you need to maintain semantic accuracy across
to maintain semantic accuracy across long sentences where LLM based junking
long sentences where LLM based junking can play a useful role. Disadvantage of
can play a useful role. Disadvantage of course computationally expensive context
course computationally expensive context window limitations because what if the
window limitations because what if the LLM determines the whole context is
LLM determines the whole context is important that can again lead to context
important that can again lead to context window limitations and the moment we
window limitations and the moment we have an LLM the output can be stoastic.
have an LLM the output can be stoastic. So if you pass this chatbot to a client
So if you pass this chatbot to a client for the same question you may get
for the same question you may get different outputs
different outputs since an LLM is determining at what
since an LLM is determining at what point you want to split into different
point you want to split into different chunks.
Okay. So yeah, Sanjiv you pointed an issue with LLM context window which I
issue with LLM context window which I covered over here in the disadvantages
covered over here in the disadvantages if we are trying to retrieve similar
if we are trying to retrieve similar content about a topic from archive or
content about a topic from archive or newspaper article.
newspaper article. So then you can even
So then you can even so that's the rack system right
so that's the rack system right basically what Anita you're saying is
basically what Anita you're saying is you want to build a rack system from an
you want to build a rack system from an archive of newspaper articles
archive of newspaper articles newspaper articles if you they might not
newspaper articles if you they might not be organized into structures or
be organized into structures or subsections right so the best way to get
subsections right so the best way to get started with is either fixed size
started with is either fixed size chunking or semantic chunking or if you
chunking or semantic chunking or if you can do OCR and you can make sections
can do OCR and you can make sections based on font S so which I'm going to
based on font S so which I'm going to share with you is that
if you take a look at this newspaper article right
article right you definitely need an OCR tool but how
you definitely need an OCR tool but how would that OCR tool know whether there
would that OCR tool know whether there are sections or subsections one way to
are sections or subsections one way to tell the OCR tool is that uh this is a
tell the OCR tool is that uh this is a business review newspaper and usually
business review newspaper and usually they have this font so wherever this
they have this font so wherever this font is there make it as a section
font is there make it as a section wherever slightly smaller font is there
wherever slightly smaller font is there make it as a subsection etc. We are
make it as a subsection etc. We are going to see that today because if you
going to see that today because if you check take our document itself it's very
check take our document itself it's very difficult for our PDF tool to know where
difficult for our PDF tool to know where there are sections and subsections.
there are sections and subsections. So we are going to do a small trick when
So we are going to do a small trick when we code where we are going to identify
we code where we are going to identify where sections and subsections are
where sections and subsections are there. So if you can do this trick on
there. So if you can do this trick on your newspaper articles and if you can
your newspaper articles and if you can figure out sections and subsections you
figure out sections and subsections you can do structural drag.
can do structural drag. If you cannot figure that out, you can
If you cannot figure that out, you can do fixed size chunking as a start. If it
do fixed size chunking as a start. If it does not work, you can do semantic
does not work, you can do semantic chunking.
What do you mean by stoastic output? So by stoastic output, I mean since you're
by stoastic output, I mean since you're using an LLM by default, it's a
using an LLM by default, it's a probabilistic model, right? So if you
probabilistic model, right? So if you use one LLM and again try to use it the
use one LLM and again try to use it the next day you might maybe or use a
next day you might maybe or use a different LLM you get a different
different LLM you get a different answer.
answer. So by default the stoasticity is
So by default the stoasticity is embedded whenever you use an LLM. Jade
embedded whenever you use an LLM. Jade is asking there might be a relationship
is asking there might be a relationship between embedding size and chunk size.
between embedding size and chunk size. Yeah. Yeah. Sure. So that's a trade-off
Yeah. Yeah. Sure. So that's a trade-off right between currently we are
right between currently we are discussing the trade-offs based on
discussing the trade-offs based on structure versus no structure semantics.
structure versus no structure semantics. no semantics. Another trade-off is that
no semantics. Another trade-off is that also when we discussed structural
also when we discussed structural chunking disadvantage, we said we said
chunking disadvantage, we said we said that potentially two course chunks that
that potentially two course chunks that might affect the context window of LLMs.
might affect the context window of LLMs. Another issue is that if you have chunks
Another issue is that if you have chunks which are too large, you will need to
which are too large, you will need to use larger embedding models because if
use larger embedding models because if certain chunk is too large, you cannot
certain chunk is too large, you cannot let's say use this sentence all MP net
let's say use this sentence all MP net base V2 since the input text is only 384
base V2 since the input text is only 384 pieces.
Uh so for larger we have to use VLM 1k token size and for paragraph maybe B.
token size and for paragraph maybe B. Yeah sure. Um
Yeah sure. Um VLM though you use if you use Peter it's
VLM though you use if you use Peter it's always going to be extremely
always going to be extremely computationally expensive.
You can simply without using VLM if an OCR tool can be used that might be
OCR tool can be used that might be faster. Many people have asked question
faster. Many people have asked question about stoastic output. When I said
about stoastic output. When I said stoastic output I just meant since it's
stoastic output I just meant since it's an LLM and every time you interact with
an LLM and every time you interact with an LLM the output can be different.
How about documents which are related to each other? What chunking strategy would
each other? What chunking strategy would you use? So good question.
you use? So good question. uh if the documents so if the documents
uh if the documents so if the documents are related to each other right then
are related to each other right then hopefully the chunks will also be
hopefully the chunks will also be related to each other and then when you
related to each other and then when you retrieve the chunks you can just
retrieve the chunks you can just retrieve more chunks so that chunks from
retrieve more chunks so that chunks from multiple documents are retrieved
embedding model like gecko embedding as fixed size dimension 768 how do we
fixed size dimension 768 how do we choose embedding model we are going to
choose embedding model we are going to come to that so there is a separate
come to that so there is a separate section which I have on embedding and we
section which I have on embedding and we have a engineering choice section there
have a engineering choice section there also but that will come tomorrow.
also but that will come tomorrow. Today what we have to do is focus on
Today what we have to do is focus on chunking right now. So let me come to an
chunking right now. So let me come to an engineer's choice section over here
engineer's choice section over here and uh let me share with all of you
and uh let me share with all of you which chunking strategy to use at what
which chunking strategy to use at what time.
time. Okay. Um
Okay. Um so the five things which we discussed
so the five things which we discussed right the main thing is that if you have
right the main thing is that if you have data which is inorganized and messy and
data which is inorganized and messy and huge amount of data which has no no real
huge amount of data which has no no real structure no real shape go ahead with
structure no real shape go ahead with fixed size chunking and see if things
fixed size chunking and see if things are working. It's usually the simplest
are working. It's usually the simplest and the fastest method.
and the fastest method. Then if you're working with a client
Then if you're working with a client like medical sector, educational sector,
like medical sector, educational sector, healthcare sector who maintain records.
healthcare sector who maintain records. If records are maintained, usually there
If records are maintained, usually there will be some sort of a structure to the
will be some sort of a structure to the records. So you have to by default start
records. So you have to by default start with structure based chunking.
with structure based chunking. If you do structure based chunking and
If you do structure based chunking and if you see that some chunks are too
if you see that some chunks are too large, then you will have to split it
large, then you will have to split it recursively using recursive chunking.
recursively using recursive chunking. Now if you see that if if you are
Now if you see that if if you are transcribing if you are using
transcribing if you are using transcripts from a debate or from an
transcripts from a debate or from an educational video you want to make a
educational video you want to make a rack chatbot based on let's say the
rack chatbot based on let's say the build lm from scratch playlist but I
build lm from scratch playlist but I have not added timestamps over there you
have not added timestamps over there you will still need to make chunks where
will still need to make chunks where semantic integrity is maintained between
semantic integrity is maintained between ideas that's when you have to use
ideas that's when you have to use semantic chunking
semantic chunking and if everything fails if your the
and if everything fails if your the context which you are analyzing has
context which you are analyzing has sharp twists and turns.
sharp twists and turns. If there is drift in the user
If there is drift in the user interactions,
interactions, then you should do LLM chunking. But
then you should do LLM chunking. But this is extremely expensive. So, usually
this is extremely expensive. So, usually no one does it for large documents. You
no one does it for large documents. You can do it for uh a smaller collection of
can do it for uh a smaller collection of documents.
All of these things which I me which I'm mentioning right now, right? I have
mentioning right now, right? I have actually collected all of those in uh
actually collected all of those in uh in this chunking strategies guide book
in this chunking strategies guide book which I shared with all of you.
which I shared with all of you. Actually, let me add this in the drive
Actually, let me add this in the drive folder right now. I have shared this
folder right now. I have shared this drive folder at the beginning of
drive folder at the beginning of today's class. So, I have added this PDF
today's class. So, I have added this PDF over here.
over here. This PDF.
Yeah, this PDF, right? So if you double click on this PDF, you'll see this
click on this PDF, you'll see this 20page PDF which contains all the
20page PDF which contains all the different types of chunking which we
different types of chunking which we have discussed
have discussed and it also contains a very detailed
and it also contains a very detailed section of which chunking strategy to
section of which chunking strategy to use at what point. Some of the examples
use at what point. Some of the examples which I have shown here are actually
which I have shown here are actually some examples in industries which we
some examples in industries which we have implemented.
have implemented. And towards the end I have an additional
And towards the end I have an additional section which is uh I have not studied
section which is uh I have not studied these techniques in too much detail but
these techniques in too much detail but someone asked a question about
someone asked a question about multimodel aware chunking right. This is
multimodel aware chunking right. This is a topic of active research right now.
a topic of active research right now. Another topic of active research is
Another topic of active research is something called query directed chunking
something called query directed chunking or dynamic chunking where chunking is
or dynamic chunking where chunking is done on the fly. So instead of doing
done on the fly. So instead of doing chunking before and so the rack pipeline
chunking before and so the rack pipeline we're discussing is sequential right
we're discussing is sequential right chunking embedding retrieval instead of
chunking embedding retrieval instead of doing chunking like in a predefined way
doing chunking like in a predefined way you can do chunking on the fly.
you can do chunking on the fly. So whenever a question is asked you form
So whenever a question is asked you form a chunk at that time itself that's
a chunk at that time itself that's called query directed chunking. So if
called query directed chunking. So if anyone wants to get into chunking
anyone wants to get into chunking related research these are some very
related research these are some very good topics. Multimodel aware chunking
good topics. Multimodel aware chunking is one topic. Query directed chunking is
is one topic. Query directed chunking is another topic.
Um there was a section which I have
there was a section which I have actually which we can go through quickly
actually which we can go through quickly and this will again make sure that the
and this will again make sure that the questions are clear. So if you
questions are clear. So if you so these are some class questions to
so these are some class questions to check which trunking strategy to use.
check which trunking strategy to use. Let's say if you're in a legal domain,
Let's say if you're in a legal domain, you're building a system to answer
you're building a system to answer questions about a country's laws and
questions about a country's laws and laws are divided into articles and
laws are divided into articles and sections. If you are building a chatbot,
sections. If you are building a chatbot, what chunking strategy will you use over
what chunking strategy will you use over here?
Yeah. So those of you who are answering structural or recursive, that's the
structural or recursive, that's the correct answer. So I would actually go
correct answer. So I would actually go ahead with structural first. If it leads
ahead with structural first. If it leads to larger chunks, I would go ahead with
to larger chunks, I would go ahead with uh recursive. I would encourage everyone
uh recursive. I would encourage everyone to answer on chat because this will test
to answer on chat because this will test your understanding your knowledge.
your understanding your knowledge. Second is a financial domain. So if you
Second is a financial domain. So if you have an earnings call transcription and
have an earnings call transcription and want to do Q&A on them
want to do Q&A on them where it might not have clear headings
where it might not have clear headings for each question then what will you do
for each question then what will you do in this stage
in this stage where you just have the transcripts of
where you just have the transcripts of the calls and you want to have a Q&A
the calls and you want to have a Q&A chatbot.
Yeah. So the correct answer here is I would actually go ahead with a fixed
would actually go ahead with a fixed chunking initially to see how it is
chunking initially to see how it is working because it's fast and if it
working because it's fast and if it produces answers which are not coherent
produces answers which are not coherent I would go ahead with semantic chunking.
I would go ahead with semantic chunking. So the correct answer to the first
So the correct answer to the first question is
structural and recursive. The correct answer to the second question is
answer to the second question is semantic.
semantic. What about the third? You are processing
What about the third? You are processing patient electronic health records where
patient electronic health records where there are fields like chief complaint,
there are fields like chief complaint, history of present illness, lab results,
history of present illness, lab results, assessment and plan etc. Which chunking
assessment and plan etc. Which chunking strategy you would go ahead with in this
strategy you would go ahead with in this case?
Exactly. So all of you are now getting the correct answer here. You will
the correct answer here. You will definitely go ahead with the structural.
definitely go ahead with the structural. And the last example I have is again
And the last example I have is again from this build LLM from let's say if
from this build LLM from let's say if you have this playlist and if you want
you have this playlist and if you want to have a set of lecture transcripts
to have a set of lecture transcripts which you want to make it into a chatbot
which you want to make it into a chatbot based system how will you do this
semantic or if semantic does not work maybe LLM based chunking
maybe LLM based chunking good so now I hope that chunking
good so now I hope that chunking strategies are clear for everyone
strategies are clear for everyone what are chunking ing strategies. Why we
what are chunking ing strategies. Why we should use each chunking strategy and
should use each chunking strategy and the best way to evaluate a chunking
the best way to evaluate a chunking strategy is of course based on the
strategy is of course based on the responses right it is not so only after
responses right it is not so only after the whole pipeline is done you will be
the whole pipeline is done you will be able to evaluate the chunking strategy
able to evaluate the chunking strategy in fact this code file which I have
in fact this code file which I have shared with you like this LLM production
shared with you like this LLM production rack main code file I have two other
rack main code file I have two other code files
code files uh which are related to semantic
uh which are related to semantic chunking
chunking and which are related related to
and which are related related to structural chunking. So what I have done
structural chunking. So what I have done is that in the current code file which
is that in the current code file which we'll start exploring again from
we'll start exploring again from tomorrow, we are going to use fixedsiz
tomorrow, we are going to use fixedsiz chunking but I have run this exact same
chunking but I have run this exact same code file using
code file using semantic chunking as well as structural
semantic chunking as well as structural chunking and then when you reach the end
chunking and then when you reach the end of both of these you can clearly compare
of both of these you can clearly compare the differences between which chunking
the differences between which chunking to use in your system. So even when you
to use in your system. So even when you have decided the chunking strategy only
have decided the chunking strategy only after you have run the whole rack
after you have run the whole rack pipeline you will be in a situation to
pipeline you will be in a situation to compare which strategy is the best.
Uh but you have to go through the full rack pipeline. So it's similar to tuning
rack pipeline. So it's similar to tuning itself. You can think of it as a
itself. You can think of it as a hyperparameter and you should have some
hyperparameter and you should have some rag evaluation framework and just check
rag evaluation framework and just check the metric in your framework. So then
the metric in your framework. So then later you can have a graph between
later you can have a graph between uh later you can have a graph between
uh later you can have a graph between rag evaluation
rag evaluation rag u frameworks and your chunking
rag u frameworks and your chunking strategy. So chunking strategy number
strategy. So chunking strategy number one chunking strategy number two
one chunking strategy number two chunking strategy number three. So it
chunking strategy number three. So it might happen that on one evaluation some
might happen that on one evaluation some chunking strategy might be working
chunking strategy might be working better on other framework on other
better on other framework on other evaluation metric some other chunking
evaluation metric some other chunking strategy would do better etc.
strategy would do better etc. That's the way you decide what chunking
That's the way you decide what chunking strategy to deploy in practice
uh abilation study and intuition I would say because
say because I'm going to now take you through a code
I'm going to now take you through a code where you can actually visualize the
where you can actually visualize the chunks
chunks and before doing the rest of the rack
and before doing the rest of the rack pipeline you can visualize the chunks
pipeline you can visualize the chunks you can check their size you can check
you can check their size you can check the variance in the sizes and then you
the variance in the sizes and then you can try to get to an intuition of
can try to get to an intuition of whether this will work for your rag rag
whether this will work for your rag rag application or not. That intuition is
application or not. That intuition is important to develop which I'm going to
important to develop which I'm going to show you right now when we are going to
show you right now when we are going to look through code.
look through code. So in the code which I'm which I'll
So in the code which I'm which I'll share with all of you right now what we
share with all of you right now what we are going to do is that all these
are going to do is that all these evaluation strategies which we have seen
evaluation strategies which we have seen right now we are going to evaluate these
right now we are going to evaluate these strategies and when I say evaluate we
strategies and when I say evaluate we are going to look at the different
are going to look at the different chunks which are being formed the sizes
chunks which are being formed the sizes of the chunks we are going to look at
of the chunks we are going to look at the size variance in the chunks this can
the size variance in the chunks this can be one evaluation which you can do
be one evaluation which you can do before even running the entire rack
before even running the entire rack pipeline. So what all you have learned
pipeline. So what all you have learned so far in theory, we are now going to
so far in theory, we are now going to put it in practice.
put it in practice. We are going to uh put it in practice
We are going to uh put it in practice when we are going to run this chunking
when we are going to run this chunking strategies notebook. So I'm just going
strategies notebook. So I'm just going to share it with all of you.
Just take a look at this code again here. Before going through the code, I'm
here. Before going through the code, I'm going to take a break of around 2 to 3
going to take a break of around 2 to 3 minutes
minutes and then uh this will be the last
and then uh this will be the last section which we will cover today.
Tomorrow we are going to look at embeddings and assembling the whole
embeddings and assembling the whole pipeline
pipeline um and the remaining aspects. But today
um and the remaining aspects. But today until this point it's going to be the
until this point it's going to be the last thing. So we are already at 2 and a
last thing. So we are already at 2 and a half hours into the workshop. Uh I don't
half hours into the workshop. Uh I don't know how much more time this will take
know how much more time this will take but I'll try to finish it uh in the next
but I'll try to finish it uh in the next 30 30 minutes or so.
30 30 minutes or so. So let me stop for some time and I'll
So let me stop for some time and I'll come back maybe in 3 to 4 minutes at
come back maybe in 3 to 4 minutes at 10:40 or 10:41 a.m. IST.
10:40 or 10:41 a.m. IST. Uh thanks guys for those of you who have
Uh thanks guys for those of you who have who have still here. It's great u that
who have still here. It's great u that you are continuing to follow. We only
you are continuing to follow. We only have around 30 to 35 more minutes to go.
have around 30 to 35 more minutes to go. Um after that we'll see the rest in
Um after that we'll see the rest in tomorrow's lecture. So I'll come back in
tomorrow's lecture. So I'll come back in some time.
Okay. So now let us continue with the last part of today's lecture.
last part of today's lecture. uh there is a question on the chat
uh there is a question on the chat related to how do we decide the chunk
related to how do we decide the chunk size if we are going ahead with the
size if we are going ahead with the combination of structural plus recursive
combination of structural plus recursive or legal framework. So that is going to
or legal framework. So that is going to be based on the document itself. So the
be based on the document itself. So the way I would do it is that let's say if
way I would do it is that let's say if you look at a document
you look at a document uh and you try to kind of form a vague
uh and you try to kind of form a vague notion of what is the average section
notion of what is the average section size or what's the median value of the
size or what's the median value of the section size length and you can use that
section size length and you can use that as the maximum chunk size. So this will
as the maximum chunk size. So this will make sure that at least major sections
make sure that at least major sections which are not big enough are retained
which are not big enough are retained into one chunk and then the sections
into one chunk and then the sections which are outlier in terms of their
which are outlier in terms of their length can be in separate chunks.
Okay. So I hope all of you have access to this. Let's start running the first
to this. Let's start running the first code cell and let's start running the
code cell and let's start running the second code cell also.
second code cell also. In this what we are doing here is that
In this what we are doing here is that until step number two we have already
until step number two we have already seen in the previous
seen in the previous code right we are just loading the
code right we are just loading the document over here and uh we are
document over here and uh we are collecting this list of multiple
collecting this list of multiple dictionaries and each dictionary will be
dictionaries and each dictionary will be corresponding to a page number the
corresponding to a page number the character count on that page the word
character count on that page the word count on that page sentence count on
count on that page sentence count on that page and token count.
that page and token count. This will again take some time to run
This will again take some time to run but we are on T4 GPU at the moment.
How many of you have finished running step number zero, step number one, and
step number zero, step number one, and step number two? Can you mention in the
step number two? Can you mention in the chat?
chat? Done. Right. Okay. So, I I'm running
Done. Right. Okay. So, I I'm running this a bit late. So, it is taking a bit
this a bit late. So, it is taking a bit of time for me. But meanwhile, let's go
of time for me. But meanwhile, let's go through step number three. Now this is
through step number three. Now this is the main step. What we are doing here is
the main step. What we are doing here is that we are testing five chunking
that we are testing five chunking strategies on our data set. What is our
strategies on our data set. What is our data set? Our data set is this
data set? Our data set is this which is around 1200 page PDF and we are
which is around 1200 page PDF and we are testing out different chunking
testing out different chunking strategies on this data. The first
strategies on this data. The first chunking strategy is where we are going
chunking strategy is where we are going to have 500 characters in one chunk.
to have 500 characters in one chunk. That's it. That's fixed size chunking.
That's it. That's fixed size chunking. Right? So the simp a simple thing to do
Right? So the simp a simple thing to do here is that you just go through all the
here is that you just go through all the words in a page and uh you keep on
words in a page and uh you keep on adding the words until they hit this
adding the words until they hit this fixed size
fixed size and once you become greater than that
and once you become greater than that size then you stop that chunk.
size then you stop that chunk. So this is the simplest one to actually
So this is the simplest one to actually explain and also to execute. Basically,
explain and also to execute. Basically, you just uh you pass the text of a page
you just uh you pass the text of a page into this chunk text function.
into this chunk text function. And this chunk text function, what it
And this chunk text function, what it does is that it just looks at that text
does is that it just looks at that text and keeps on forming groups of 500
and keeps on forming groups of 500 characters into each chunk.
characters into each chunk. So you can run this right now.
So you can run this right now. So you can see that the total chunks are
So you can see that the total chunks are 3321
3321 where each chunk is of 500 characters in
where each chunk is of 500 characters in our current PDF.
our current PDF. Now what I wanted to do after this point
Now what I wanted to do after this point is I actually want to visualize what
is I actually want to visualize what every chunk looks like. So I have
every chunk looks like. So I have written a simple function over here
written a simple function over here which just prints out uh the chunk and
which just prints out uh the chunk and what it looks like. So you can run this
what it looks like. So you can run this and you will see the chunks actually
and you will see the chunks actually look like this. In this function, I'm
look like this. In this function, I'm getting five
getting five um
um I'm getting five chunks which are
I'm getting five chunks which are randomly scattered throughout the data
randomly scattered throughout the data set. So take a look at the number of
set. So take a look at the number of characters in this chunk. You'll see
characters in this chunk. You'll see that in some places there are character
that in some places there are character numbers 290. Can someone tell me why
numbers 290. Can someone tell me why this is happening? It should ideally be
this is happening? It should ideally be 500, right?
Why are the number of characters here 290?
The reason this is happening is because we are looking at every uh page
we are looking at every uh page separately. So maybe we form one chunk,
separately. So maybe we form one chunk, we form the second chunk and the last
we form the second chunk and the last chunk only has this much.
chunk only has this much. That's why it might happen that some
That's why it might happen that some chunks would have lower number of
chunks would have lower number of characters than other chunks.
Um, but if you take a look at these five
but if you take a look at these five chunks, you'll see that mostly their
chunks, you'll see that mostly their chunk size, the character size will be
chunk size, the character size will be around 490 to 500.
around 490 to 500. And this is how every chunk will look
And this is how every chunk will look like. So now when I told you at the
like. So now when I told you at the start of the lecture that
start of the lecture that something will be retrieved.
something will be retrieved. If you remember at the start we saw that
If you remember at the start we saw that something will be retrieved and passed
something will be retrieved and passed to the LLM. Right? This retrieved
to the LLM. Right? This retrieved context
context this relevant context I called it I
this relevant context I called it I called it relevant context over here.
called it relevant context over here. This relevant context will be retrieved
This relevant context will be retrieved and passed to the LLM. That is these
and passed to the LLM. That is these chunks. Each chunk is a relevant
chunks. Each chunk is a relevant context. So when I say relevant context,
context. So when I say relevant context, it will be these chunks which are passed
it will be these chunks which are passed to the LLM. And now along with the
to the LLM. And now along with the prompt, the LLM will also have access to
prompt, the LLM will also have access to these chunks. So it's similar to in an
these chunks. So it's similar to in an open book exam. If you you are doing
open book exam. If you you are doing this open book test and you think
this open book test and you think something is important, you will maybe
something is important, you will maybe highlight this and you will utilize this
highlight this and you will utilize this information for answering. Right? So
information for answering. Right? So this highlighting a piece of text while
this highlighting a piece of text while solving that open book exam is exactly
solving that open book exam is exactly similar to these chunks which I have
similar to these chunks which I have shown over here
where we have defined that we are looking page wise. The place where we
looking page wise. The place where we have defined it is over here. So
have defined it is over here. So see here the way it is happening is that
see here the way it is happening is that we are first extracting the text only
we are first extracting the text only for a certain page. So the for loop goes
for a certain page. So the for loop goes over pages. We are looking at one page
over pages. We are looking at one page and one page text we are passing to this
and one page text we are passing to this chunk text function.
chunk text function. So it will form chunks for that one
So it will form chunks for that one page. Then we go to the next page.
page. Then we go to the next page. Uh likewise
uh okay so this is the first method of chunking. The second method of chunking
chunking. The second method of chunking which we are going to see is semantic
which we are going to see is semantic chunking. So here you will need to
chunking. So here you will need to install the sentence transformers
install the sentence transformers package and the sentence transformer
package and the sentence transformer which we are going to use here is all
which we are going to use here is all mini LM L6 V2. So let me show you
mini LM L6 V2. So let me show you this is the one which we are going to
this is the one which we are going to use
use and uh
yeah you can see the number of downloads for this. It's around uh
for this. It's around uh 9
9 90 million I think 90 million downloads
90 million I think 90 million downloads per month. It's a very popular sentence
per month. It's a very popular sentence transformer model. We are going to so
transformer model. We are going to so remember in semantic chunking what we
remember in semantic chunking what we saw is that
saw is that in semantic chunking
you are going to take every sentence you're going to convert it into a vector
you're going to convert it into a vector and you are going to keep on adding the
and you are going to keep on adding the remaining sentences until the cosine
remaining sentences until the cosine similarity becomes greater than a
similarity becomes greater than a certain threshold.
Where is my file? Yeah. So this is my sentence transformer and in semantic
sentence transformer and in semantic chunking what we are doing is that we
chunking what we are doing is that we are ultimately going to call the same
are ultimately going to call the same function. We are going to look at every
function. We are going to look at every page and we are going to call this
page and we are going to call this semantic chunk text for every page. And
semantic chunk text for every page. And in this semantic chunk text what we are
in this semantic chunk text what we are going to do is that we are going to
going to do is that we are going to first break that page into sentences.
first break that page into sentences. For that we are going to use this sent
For that we are going to use this sent tokenize function from the NLTK library.
tokenize function from the NLTK library. U and then we are going to keep on
U and then we are going to keep on appending the sentences to the current
appending the sentences to the current chunk until the similarity score is
chunk until the similarity score is greater than the similarity threshold.
greater than the similarity threshold. If the similarity score is not greater
If the similarity score is not greater then we break out of this loop.
then we break out of this loop. Basically
Basically I hope all of you understand this logic
I hope all of you understand this logic in the code. It's exactly similar to
in the code. It's exactly similar to what we had imple what we had discussed
what we had imple what we had discussed over here. you essentially maintain the
over here. you essentially maintain the new chunk and you keep on adding new
new chunk and you keep on adding new sentences to that until the similarity
sentences to that until the similarity score becomes greater. There is a
score becomes greater. There is a question regarding how to know which
question regarding how to know which model to use. We are going to see
model to use. We are going to see tomorrow how to use embedding models.
tomorrow how to use embedding models. But as a baseline rule of thumb if you
But as a baseline rule of thumb if you are new to this field
are new to this field uh all mini LM L6 V2 which I will also
uh all mini LM L6 V2 which I will also share in the chat this this embedding
share in the chat this this embedding model and also the MPET
model and also the MPET V2
V2 uh the previous embedding model which we
uh the previous embedding model which we saw
saw MPET base not the base one
MPET base not the base one the V2 this is the one which are good
the V2 this is the one which are good starting points
starting points How about semantics chunking option from
How about semantics chunking option from lang chain? So as sumat you have noted
lang chain? So as sumat you have noted many frameworks give semantic chunking
many frameworks give semantic chunking options which are exactly similar to
options which are exactly similar to what I'm showing right now. The reason
what I'm showing right now. The reason I'm showing it from scratch is because
I'm showing it from scratch is because then you will find all those approaches
then you will find all those approaches extremely simple and easy to work with.
extremely simple and easy to work with. But under the hood under the hood they
But under the hood under the hood they are doing the same thing what we are
are doing the same thing what we are doing over here.
doing over here. So you can run this right now and what
So you can run this right now and what this will do is that this will actually
this will do is that this will actually need to download. So this line which I'm
need to download. So this line which I'm saying right now I'm going through it a
saying right now I'm going through it a bit quickly but you should appreciate
bit quickly but you should appreciate the power of the opensource community
the power of the opensource community over here because if all of these models
over here because if all of these models were not made open source and if all of
were not made open source and if all of these models were not uploaded on
these models were not uploaded on hugging face it would have been very
hugging face it would have been very difficult for us to utilize these models
difficult for us to utilize these models as simply right. So this looks like just
as simply right. So this looks like just one sentence but don't take this
one sentence but don't take this lightly. A whole revolution has happened
lightly. A whole revolution has happened open source revolution before we have
open source revolution before we have come to this stage where we can very
come to this stage where we can very openly use hugging face
openly use hugging face and run these code files in Google
and run these code files in Google collab.
collab. So I am using a similarity threshold of
So I am using a similarity threshold of 75 over here. You can feel free to use
75 over here. You can feel free to use a different similarity threshold as
a different similarity threshold as well. So here you can see we are now uh
well. So here you can see we are now uh doing semantic chunking for every page.
Oh, I think my audio was was cut off for some reason.
some reason. Yeah. Okay. Now,
u here we can see that the total number of semantic chunks which are formed are
of semantic chunks which are formed are 1206.
1206. Now, can you already see one thing? Why
Now, can you already see one thing? Why do you think the number of chunks in
do you think the number of chunks in this case are so much larger than the
this case are so much larger than the number of chunks which we saw in Psiz
number of chunks which we saw in Psiz chunking?
Can you try to think why the number of chunks in this case are so large in
chunks in this case are so large in number?
Yeah, the reason is because it really does not make sense to use semantic
does not make sense to use semantic chunking on our document. First of all,
chunking on our document. First of all, there is chapters are small and then
there is chapters are small and then chapters are not that related to each
chapters are not that related to each other. Even within a chapter, there
other. Even within a chapter, there might be concepts cover different
might be concepts cover different concepts in different paragraphs. So the
concepts in different paragraphs. So the semantic notion changes very quickly.
semantic notion changes very quickly. That's why very small chunks are
That's why very small chunks are becoming individual chunks which is an
becoming individual chunks which is an issue here. Uh so you can actually print
issue here. Uh so you can actually print out so if you print out the chunks in
out so if you print out the chunks in semantic chunking you'll see the wide
semantic chunking you'll see the wide variance right in some chunk is this
variance right in some chunk is this just this much some chunk has around 200
just this much some chunk has around 200 characters
characters uh some chunk has 78 characters that's
uh some chunk has 78 characters that's one disadvantage of semantic chunking.
one disadvantage of semantic chunking. Again there is diversity in the chunk
Again there is diversity in the chunk size
size but semantic notion between or within a
but semantic notion between or within a chunk is actually maintained.
chunk is actually maintained. So in in semantic chunking we got around
So in in semantic chunking we got around 12,000 total chunks.
12,000 total chunks. Why is there no continuity of text in
Why is there no continuity of text in this chunk? So I have just randomly
this chunk? So I have just randomly printed out not the entire chunk.
printed out not the entire chunk. uh but in some cases
uh but in some cases the reason there is no continuity is
the reason there is no continuity is after this sentence there might be
after this sentence there might be something new which might be starting
something new which might be starting completely right maybe it's not related
completely right maybe it's not related to the previous sentence at all let's
to the previous sentence at all let's just actually check this sentence now in
just actually check this sentence now in the document
So this is that sentence I think observing the connection between
observing the connection between beverage and longevity. Dr. Machik of
beverage and longevity. Dr. Machik of began his research on beneficial this
began his research on beneficial this sentence. Right?
sentence. Right? Now the thing is the sentence
Now the thing is the sentence transformer model. If you pass this
transformer model. If you pass this sentence and if you pass this sentence
sentence and if you pass this sentence according to the sentence transformer
according to the sentence transformer model the semantic similarity between
model the semantic similarity between this sentence and this sentence is less
this sentence and this sentence is less than 75.
Um so if you want to have longer chunks one thing which you can do is actually
one thing which you can do is actually reduce the threshold score over here.
reduce the threshold score over here. You can reduce the threshold score to 6.
You can reduce the threshold score to 6. you can reduce the threshold score to 0
you can reduce the threshold score to 0 55 also.
uh but I hope all of you can start seeing that what are we doing exactly
seeing that what are we doing exactly here and we are relying on this sentence
here and we are relying on this sentence transformer model right so it the
transformer model right so it the semantic threshold semantic score might
semantic threshold semantic score might be lesser than the threshold that's why
be lesser than the threshold that's why it's not characterized into one chunk
Now let me come to the third section which is recursive chunking.
which is recursive chunking. So the reason this collab code file has
So the reason this collab code file has become long is because later when you
become long is because later when you refer to it I have also mentioned
refer to it I have also mentioned detailed examples over here. So if you
detailed examples over here. So if you forget what was covered in the
forget what was covered in the mirrorboard notes, even if you have
mirrorboard notes, even if you have access to this code file, it'll be easy
access to this code file, it'll be easy for you to revise everything in one
for you to revise everything in one notebook.
notebook. So if you're going for an interview or
So if you're going for an interview or something later, just look at one
something later, just look at one notebook, run it and for chunking
notebook, run it and for chunking strategies which are a bit difficult to
strategies which are a bit difficult to understand such as semant recursive
understand such as semant recursive chunking, I have deliberately added this
chunking, I have deliberately added this text section over here.
text section over here. Now
Now uh
recursive uh the reason the way we are going to do recursive chunking is
going to do recursive chunking is something very simple over here. The
something very simple over here. The first so first as I mentioned we have to
first so first as I mentioned we have to define a maximum chunk size right which
define a maximum chunk size right which is equal to th00and
is equal to th00and um
um how which is this Google collab
how which is this Google collab notebook. So the Google collab notebook
notebook. So the Google collab notebook you mean the access to this notebook or
you mean the access to this notebook or I did not understand the question.
I did not understand the question. How do we access the mirror notebook?
How do we access the mirror notebook? Oh this notebook I have shared this
Oh this notebook I have shared this notebook link with everyone on the chat.
notebook link with everyone on the chat. Oh, this notebook, this is the same as
Oh, this notebook, this is the same as the one which I shared with you on uh
the one which I shared with you on uh chunking strategies. So, this copy of
chunking strategies. So, this copy of chunking strategies and what I'm doing
chunking strategies and what I'm doing right now is the same except that the
right now is the same except that the API keys are removed in this
murotes. I will be sharing with all registered people. I I'll just share the
registered people. I I'll just share the link to this mirror note. So the way we
link to this mirror note. So the way we are doing recursive chunking over here
are doing recursive chunking over here is that first we'll get the chunk right
is that first we'll get the chunk right and as I told you in recursive chunking
and as I told you in recursive chunking nothing should be greater than a certain
nothing should be greater than a certain chunk size which is 1,000 in this case.
chunk size which is 1,000 in this case. So we'll check if it is greater than
So we'll check if it is greater than thousand.
thousand. Uh if it is less than 1,000 then that
Uh if it is less than 1,000 then that becomes a full chunk. That's fine. If
becomes a full chunk. That's fine. If it's greater than thousand we'll first
it's greater than thousand we'll first chunk by double new lines. The second
chunk by double new lines. The second recursion is single new line and the
recursion is single new line and the final recussion is sentence. So if you
final recussion is sentence. So if you go to the recursive thing which we saw
go to the recursive thing which we saw in this example I had a section
in this example I had a section paragraph and sentence right in the code
paragraph and sentence right in the code file which I'm just showing all of you
file which I'm just showing all of you in the code file we have three sections
in the code file we have three sections in the code file we have double new line
in the code file we have double new line we have single new line and we have
we have single new line and we have sentence
these are the three sections. So first we'll do chunking at this level. If the
we'll do chunking at this level. If the number of characters are greater than
number of characters are greater than 500, we do chunking at this second
500, we do chunking at this second level. If it's still greater than 500,
level. If it's still greater than 500, we do sentence level chunking.
we do sentence level chunking. So that's where you'll see three
So that's where you'll see three sections.
sections. Um chunking at this double line, that's
Um chunking at this double line, that's the top level chunking. Then the second
the top level chunking. Then the second recursion is splitting by single new
recursion is splitting by single new line. Final recussion is splitting by
line. Final recussion is splitting by sentences.
sentences. So you can run this now and again
So you can run this now and again run the number of recursive chunks are 2
run the number of recursive chunks are 2 4 3 4. So definitely lesser than
4 3 4. So definitely lesser than semantic chunking.
semantic chunking. And here you will see the different
And here you will see the different chunks. The number of characters are 700
chunks. The number of characters are 700 780 800. They will definitely be less
780 800. They will definitely be less than 1,000 because that's the maximum
than 1,000 because that's the maximum chunk size which we have defined.
Oh. Oh yeah. So actually Amit this is a completely different notebook
completely different notebook because the LLM rag notebook which we
because the LLM rag notebook which we started out with that is a different
started out with that is a different notebook. We have covered only until
notebook. We have covered only until this part in that notebook. Tomorrow we
this part in that notebook. Tomorrow we will continue from this.
will continue from this. But this LLM chunking strategies is a
But this LLM chunking strategies is a new notebook which I think someone has
new notebook which I think someone has shared in the chat again.
Okay. So now we have done recursive chunking. Now I want to do structure
chunking. Now I want to do structure based junking and here we are going to
based junking and here we are going to do a small trick.
do a small trick. So and this is again comes to the
So and this is again comes to the engineer's decision right I have looked
engineer's decision right I have looked at these chapters and let me ask you
at these chapters and let me ask you this question I want to do a simple
this question I want to do a simple structural chunking right now where I
structural chunking right now where I want every one chapter to be one chunk
want every one chapter to be one chunk because at the start we asked chat GPT
because at the start we asked chat GPT how much total tokens was there right
how much total tokens was there right it's around um 1.4 4 million and if
it's around um 1.4 4 million and if there are 20 chapters each chapter will
there are 20 chapters each chapter will be maybe 20,000 30,000 tokens. So I want
be maybe 20,000 30,000 tokens. So I want one chapter to be one chunk.
one chapter to be one chunk. Now if you are given this problem that
Now if you are given this problem that you want one chapter to be one chunk.
you want one chapter to be one chunk. How will you do this? You have right now
How will you do this? You have right now um let me go to the table of contents.
Yeah this is the table of content right and I want each chapter here. Let's say
and I want each chapter here. Let's say this chapter I want the whole chapter to
this chapter I want the whole chapter to be one chunk. So this whole chapter is
be one chunk. So this whole chapter is one chunk. Then food quality is the
one chunk. Then food quality is the second chapter is the second chunk. That
second chapter is the second chunk. That seems the best thing to do over here.
seems the best thing to do over here. Right? Because anyways each chapter is
Right? Because anyways each chapter is very small. So why not have one chapter
very small. So why not have one chapter as one chunk. So this lifestyles and
as one chunk. So this lifestyles and nutrition will be one chunk.
nutrition will be one chunk. U then
U then achieving a healthy diet will be one
achieving a healthy diet will be one chunk. How will you do this? How will
chunk. How will you do this? How will you tell the PDF when the chapter starts
use chapter marker headings OCR so let's say we don't want to use
OCR so let's say we don't want to use OCR Samrat for this and even when you
OCR Samrat for this and even when you use OCR how will you do it
so one suggestion which many people have mentioned is exploit the
mentioned is exploit the uh
uh this font style
this font style because only the title will have this
because only the title will have this font. That way you'll know at what page
font. That way you'll know at what page you are or someone has mentioned go
you are or someone has mentioned go through the table of contents but that
through the table of contents but that will be a bit tricky.
will be a bit tricky. Use page number information from index.
Use page number information from index. That's also very interesting. Split PDF
That's also very interesting. Split PDF in pages. We are already splitting PDF
in pages. We are already splitting PDF in pages, right? But how will you split
in pages, right? But how will you split according to chapters?
according to chapters? We don't want one page to be one chunk.
We don't want one page to be one chunk. We want one entire chapter to be one
We want one entire chapter to be one chunk.
Average pages each chapter has. But then again you will lose out some
again you will lose out some information. So I'm going to do a simple
information. So I'm going to do a simple trick. What I'm going to do is that
trick. What I'm going to do is that wherever a new chapter starts, I have
wherever a new chapter starts, I have seen that there is this common text
seen that there is this common text which appears everywhere. University of
which appears everywhere. University of Hawaii and human nutrition program.
Hawaii and human nutrition program. Let's see if it appears at all chapters,
Let's see if it appears at all chapters, right?
Yeah. See, whenever a new chapter starts, this text is always available.
You will see that in all the chapters. Whenever a new
Whenever a new chapter starts, this is always there. So
chapter starts, this is always there. So what I'm going to do this is called as
what I'm going to do this is called as reg x methods.
reg x methods. I'm just going to use reg x is simple
I'm just going to use reg x is simple string matching. I'm simply going to
string matching. I'm simply going to match wherever I see that string.
match wherever I see that string. Uh wherever I see this string, I'm going
Uh wherever I see this string, I'm going to look at the text which is before that
to look at the text which is before that string and I'm going to start the chunk
string and I'm going to start the chunk from that place. That's it. This trick
from that place. That's it. This trick would work here. But how to go about it
would work here. But how to go about it in general? That is the thing right
in general? That is the thing right which I'm teaching you here. There is no
which I'm teaching you here. There is no general when it comes to when it comes
general when it comes to when it comes to industrial problems
to industrial problems because one industry might have data
because one industry might have data stored in a certain manner. Another
stored in a certain manner. Another industry might have data stored in
industry might have data stored in another certain manner. You need to
another certain manner. You need to develop the capability of smartly using
develop the capability of smartly using your intuition to know when that what
your intuition to know when that what thing to do at what time. In the class,
thing to do at what time. In the class, what I can do is I can teach you five
what I can do is I can teach you five chunking strategies. But in the actual
chunking strategies. But in the actual problem you might get a document where
problem you might get a document where you might need to use tricks like this
you might need to use tricks like this which are specific to that document
uh and we have used so many such tricks in all of our industrial problems. There
in all of our industrial problems. There is usually some special thing about each
is usually some special thing about each data. For example, some data might have
data. For example, some data might have a tabular structure on a certain page.
a tabular structure on a certain page. Some data might have a concluding point
Some data might have a concluding point section at the end of every chapter. You
section at the end of every chapter. You can exploit chapter related specific
can exploit chapter related specific information based on that piece of text
information based on that piece of text and all the other pieces of information
and all the other pieces of information other students in the chat gave right
other students in the chat gave right first going to the table of contents
first going to the table of contents getting the page number from there that
getting the page number from there that is a definitely doable trick.
is a definitely doable trick. Second is of course font style that you
Second is of course font style that you get the font style based on the title
get the font style based on the title and uh you can do it in some PDF
and uh you can do it in some PDF documents sections are digitally marked
documents sections are digitally marked as sections those sections dock link can
as sections those sections dock link can itself extract as sections and
itself extract as sections and subsections
subsections but if it's an OCR type of a thing where
but if it's an OCR type of a thing where nothing is digitally marked as a section
nothing is digitally marked as a section or subsection you might need to do
or subsection you might need to do tricks such as these.
Yeah. Right. So now what we are doing is that we are uh
yeah we are at this structural chunking and you'll see we are doing a simple reg
and you'll see we are doing a simple reg x here import re which is regular
x here import re which is regular expression or reg x and we are going to
expression or reg x and we are going to search for wherever this university of
search for wherever this university of Hawaii appears
Hawaii appears and what we are going to
Yeah. So here what we are doing is that we are doing a simple reg x search
we are doing a simple reg x search wherever this text university of Hawaii
wherever this text university of Hawaii appears really and then what we are
appears really and then what we are saying is that we are going to take the
saying is that we are going to take the text before the university of Hawaii
text before the university of Hawaii header line which is going to be our
header line which is going to be our title which is going to be this at the
title which is going to be this at the moment
moment and then we are going to make a full
and then we are going to make a full chunk from that that's it.
So this this code seems a bit long but the simple logic which we are
the simple logic which we are implementing is essentially wherever
implementing is essentially wherever this text comes right that is going to
this text comes right that is going to be my start of the chunk and wherever
be my start of the chunk and wherever this text comes again that is going to
this text comes again that is going to be my end of the previous chunk and
be my end of the previous chunk and start of the new chunk
start of the new chunk if the context size of the nutrition PDF
if the context size of the nutrition PDF was huge how was the chat GP able to
was huge how was the chat GP able to identify the context size with exact
identify the context size with exact numbers ideally it shouldn't have even
numbers ideally it shouldn't have even processed that that's a good question
processed that that's a good question And I think let's see the number of
And I think let's see the number of characters here are 14 85282.
So I think it's not an exact answer here.
here. It may be based on an approximation but
It may be based on an approximation but I doubt that these are actually correct
I doubt that these are actually correct results.
Okay. So now what we can do is that let's run this
let's run this and let us
and let us see the number of chunks. So the number
see the number of chunks. So the number of chunks is 171 because those are the
of chunks is 171 because those are the number of chapters which we have and we
number of chapters which we have and we can inspect every chunk to see whether
can inspect every chunk to see whether what we have done correctly or not. So
what we have done correctly or not. So see here this is chapter 42 which is
see here this is chapter 42 which is let's say lifestyles and nutrition.
let's say lifestyles and nutrition. Let's see whether that is correct or
Let's see whether that is correct or not. So go to the table of contents.
Go to the table of contents. Where is life? Lifestyle and nutrition.
Lifestyles and nutrition I think. Yeah. So see this
Yeah. So see this this is lifestyles and nutrition. And
this is lifestyles and nutrition. And when I click on this
when I click on this Yeah. See this is our first chunk.
In addition to nutrition, health is affected by genetics etc. In addition to
affected by genetics etc. In addition to nutrition, health is affected by
nutrition, health is affected by genetics, the environment. So now we
genetics, the environment. So now we have retrieved this is one chunk which
have retrieved this is one chunk which is that chapter. Then let's see whether
is that chapter. Then let's see whether this is correct or not. Another chunk
this is correct or not. Another chunk seems to be a chapter named
seems to be a chapter named phytochemicals.
Yeah. So this is also a chapter which starts with phytochemicals or chemicals
starts with phytochemicals or chemicals in plants that may provide some health
in plants that may provide some health benefit.
benefit. Phytochemicals or chemicals in plants
Phytochemicals or chemicals in plants that may provide some health benefit.
that may provide some health benefit. Right? So in fact every single chunk if
Right? So in fact every single chunk if you see is one chapter which we have now
you see is one chapter which we have now very smartly deconstructed over here.
very smartly deconstructed over here. Um, and one thing you'll immediately
Um, and one thing you'll immediately notice is that the number of tokens in
notice is that the number of tokens in each chapter are different because of
each chapter are different because of course the number of pages in each
course the number of pages in each chapter are different.
Um, so and the number of chunks are also
so and the number of chunks are also much lesser. We only have 171 chunks
much lesser. We only have 171 chunks here. But the size variance across
here. But the size variance across chunks is very large. But you see with a
chunks is very large. But you see with a simple trick which we did here, we were
simple trick which we did here, we were able to do structure based chunking.
able to do structure based chunking. This thing which I did here when you are
This thing which I did here when you are faced with an industry problem this is
faced with an industry problem this is exactly what you will need to do for
exactly what you will need to do for your given problem because your data
your given problem because your data might be different. Your trick the the
might be different. Your trick the the trick which you implemented right now is
trick which you implemented right now is not scalable. Your trick might be
not scalable. Your trick might be something different
because you're looking at the start and end of university. Wouldn't you be
end of university. Wouldn't you be missing the topic headers? Yeah, we will
missing the topic headers? Yeah, we will be missing. We can add those manually
be missing. We can add those manually later.
later. What about images and tables in this
What about images and tables in this PDF? Are we handling them? So, images
PDF? Are we handling them? So, images and tables are there are no tables.
and tables are there are no tables. There are some tables in this PDF, but
There are some tables in this PDF, but they are treated as images here. And we
they are treated as images here. And we are only dealing with text for now. I'll
are only dealing with text for now. I'll show you tomorrow how to deal with
show you tomorrow how to deal with multimodel data and how to store those
multimodel data and how to store those as embeddings.
as embeddings. This code will not work for other PDFs.
This code will not work for other PDFs. How about making things general with
How about making things general with multiple PDFs? So with multiple PDFs
multiple PDFs? So with multiple PDFs right what you will need to do samarat
right what you will need to do samarat is only
is only like let's say for example if you have a
like let's say for example if you have a PDF right u a tool like dock
PDF right u a tool like dock will automatically generate sections and
will automatically generate sections and subsections it will identify sections
subsections it will identify sections and subsections within that pdf so then
and subsections within that pdf so then you don't need something as complex as
you don't need something as complex as what I did right now the reason I showed
what I did right now the reason I showed you this code is because you can custom
you this code is because you can custom you can also do custom structural chunks
you can also do custom structural chunks which
which very much needed in industry. If you
very much needed in industry. If you want to do a simple PDF project, dock
want to do a simple PDF project, dock link can even identify tables, it can
link can even identify tables, it can identify headings, uh sections,
identify headings, uh sections, subsections, etc.
We are only outputting some tokens. Yeah. So, we are randomly outputting
Yeah. So, we are randomly outputting some tokens from each chunk. The one
some tokens from each chunk. The one chunk is full chapter in this case. So,
chunk is full chapter in this case. So, I'm just outputting some random
I'm just outputting some random uh the start information from each
uh the start information from each chunk.
The last thing which we have to do is LLM based junking. Now this is where I
LLM based junking. Now this is where I think that if you scroll down into the
think that if you scroll down into the Google collab notebook which I have
Google collab notebook which I have shared with all of you.
shared with all of you. Uh
yeah the open API key this you will need to enter from your side because here
to enter from your side because here what we are going to do is that we are
what we are going to do is that we are going to ask an LLM to create chunk
going to ask an LLM to create chunk boundaries for us. So if you check the
boundaries for us. So if you check the prompt, we are going to tell the LLM to
prompt, we are going to tell the LLM to analyze the following text and identify
analyze the following text and identify the best point to split it between two
the best point to split it between two semantically coherent coherent parts.
semantically coherent coherent parts. That's it. So here humans are kind of
That's it. So here humans are kind of offloading everything to the LLM. Uh and
offloading everything to the LLM. Uh and we are hoping the LLM takes care of
we are hoping the LLM takes care of this. So I ran this right now and this
this. So I ran this right now and this portion will take some time uh because
portion will take some time uh because we are processing through 128 pages,
we are processing through 128 pages, right? So this portion is going to take
right? So this portion is going to take a bit of time. It is going to take a
a bit of time. It is going to take a long time actually. So I'll pause this
long time actually. So I'll pause this running for now. It might take around uh
running for now. It might take around uh I think 9 to 10 minutes.
I think 9 to 10 minutes. Let's see. Meanwhile, uh if there are
Let's see. Meanwhile, uh if there are any questions in the chat, we should
any questions in the chat, we should remove Yeah, we should remove the
remove Yeah, we should remove the University of Hawaii. That's a good
University of Hawaii. That's a good point. We should remove this for sure
point. We should remove this for sure and we should take the title into
and we should take the title into account. But those things can be done
account. But those things can be done manually.
There is a question I think about how do we deal with multiple documents.
we deal with multiple documents. So there is also something called cross
So there is also something called cross document chunking where you can the
document chunking where you can the simplest thing to do is that if you have
simplest thing to do is that if you have multiple documents related to a topic
multiple documents related to a topic just add all of that into the knowledge
just add all of that into the knowledge base and follow the same pipeline which
base and follow the same pipeline which we are doing right now for multiple
we are doing right now for multiple documents.
Do we have open source LLMs for this? We do have open source LLM for this. In
We do have open source LLM for this. In fact, tomorrow what we are going to do
fact, tomorrow what we are going to do is that the if you look at the first
is that the if you look at the first code pipeline which I shared with you uh
code pipeline which I shared with you uh this code pipeline
we will we are going to run this end to end tomorrow. And here we are going to
end tomorrow. And here we are going to use an opensource LLM.
use an opensource LLM. So we are going to run it on our own
So we are going to run it on our own GPU. Overall I want to show you closed
GPU. Overall I want to show you closed source as well as open source LLM. For
source as well as open source LLM. For closed source LLM what we are doing
closed source LLM what we are doing right now the open AI approach is fine
right now the open AI approach is fine but for open source LLM things can get
but for open source LLM things can get slightly more difficult
slightly more difficult but a local rack pipeline
but a local rack pipeline I hope all of you can understand why a
I hope all of you can understand why a local rack pipeline will be useful
local rack pipeline will be useful because some companies don't want their
because some companies don't want their data to be sent privately or they want
data to be sent privately or they want their data stored privately they want
their data stored privately they want they don't want their data to be sent to
they don't want their data to be sent to another API call let's
Does LLM chunking help reconcile if any logical differences when multiple
logical differences when multiple sources are used? No. No, not really.
sources are used? No. No, not really. Because what LLM chunking does is that
Because what LLM chunking does is that if you do semantic chunking, then some
if you do semantic chunking, then some coherency is there across different
coherency is there across different chunks. But structural chunking, fixed
chunks. But structural chunking, fixed size chunking, they don't uh take into
size chunking, they don't uh take into account semantic context at all.
Just a quick note. Yeah, the API file we can also store it in the secret keys
can also store it in the secret keys actually which is a much better practice
actually which is a much better practice to do but for the sake of simplicity I
to do but for the sake of simplicity I have just added it over here.
How do we mask PIS here? Yeah, that's
PIS here? Yeah, that's again that would come in the named
again that would come in the named entity recognition part, right?
entity recognition part, right? So let's say for example uh if you have
So let's say for example uh if you have a document so by PII I I think you mean
a document so by PII I I think you mean personal identification information
personal identification information right if there are documents when where
right if there are documents when where you need to mask PIS
you need to mask PIS it's a good question so someone had even
it's a good question so someone had even asked a named entity recognition
asked a named entity recognition question at the start of this lecture
question at the start of this lecture the way to do named entity recognition
the way to do named entity recognition is again to make uh
is again to make uh structures in the document in a very
structures in the document in a very clever format so what if you can use
clever format so what if you can use another LLM to detect places where PIIs
another LLM to detect places where PIIs are observed and then store that and
are observed and then store that and then only extract information relevant
then only extract information relevant to that. One tool which can do exactly
to that. One tool which can do exactly this is called as lang extract. How many
this is called as lang extract. How many of you are aware of this tool?
of you are aware of this tool? This tool was recently released by
This tool was recently released by Google. I it does exactly the same thing
Google. I it does exactly the same thing what you're asking. It looks for
what you're asking. It looks for specific
specific things in a piece of text and extracts
things in a piece of text and extracts only that information using large
only that information using large language models.
language models. We are right now testing this on an
We are right now testing this on an industrial application actually. Uh it's
industrial application actually. Uh it's it's it has some issues but it's an
it's it has some issues but it's an amazing tool where basically what you
amazing tool where basically what you can do is that you can extract
can do is that you can extract structured information from even
structured information from even unstructured text documents.
Um and you use an LLM to do this. So if you want PII or personal identification
you want PII or personal identification information, you can just pass that as
information, you can just pass that as input source to lang extract and then
input source to lang extract and then you retrieve only that relevant
you retrieve only that relevant information
information and then you can mask it very easily.
and then you can mask it very easily. Once you retrieve that information,
Once you retrieve that information, masking is not going to take that much
masking is not going to take that much time.
time. If you're dealing with legal files, not
If you're dealing with legal files, not just legal files, if for any any uh
just legal files, if for any any uh let's say if you want to make an other
let's say if you want to make an other application, right? If you want to make
application, right? If you want to make an adar rack
an adar rack um system, you need to know important
um system, you need to know important information related to a person which
information related to a person which can be their name, date of birth,
can be their name, date of birth, address and that can be scattered
address and that can be scattered throughout the document.
throughout the document. How do you retrieve only that portion
How do you retrieve only that portion which matters? There have been natural
which matters? There have been natural language processing models developed
language processing models developed specifically for named entity
specifically for named entity recognition. You can use those for sure
recognition. You can use those for sure but now you can even use generative
but now you can even use generative models. People who have been attending
models. People who have been attending from the live lectures at uh from day
from the live lectures at uh from day one now know the difference between
one now know the difference between the there's a trade-off when you use
the there's a trade-off when you use generative models right you cannot
generative models right you cannot randomly use it for all applications in
randomly use it for all applications in this LLM evolutionary tree this gray
this LLM evolutionary tree this gray side is the LLM part the generative
side is the LLM part the generative model part usually more expensive but
model part usually more expensive but this red tree are representative models
this red tree are representative models and they can even do many more tasks
and they can even do many more tasks uh in a much cheaper price in much
uh in a much cheaper price in much lesser certain number of parameters. So
lesser certain number of parameters. So always keep this trade-off in mind.
always keep this trade-off in mind. Don't randomly use a language model.
Don't randomly use a language model. Let's say for named entity recognition,
Let's say for named entity recognition, you can even do a simple. So for example
yeah I'm sure hugging face has huge number of models to do any right like
number of models to do any right like this is a bird-based named entity
this is a bird-based named entity recognition fully open source has around
recognition fully open source has around 2 million downloads per month looks to
2 million downloads per month looks to be highly robust
be highly robust uh which is the best tool to extract
uh which is the best tool to extract information from handwritten purchase
information from handwritten purchase purchase orders. So I cannot recommend
purchase orders. So I cannot recommend directly like that Amit because as we
directly like that Amit because as we have discussed so far it's the
have discussed so far it's the engineer's choice right
engineer's choice right but hopefully I've given you enough
but hopefully I've given you enough tools to make that choice yourself now
tools to make that choice yourself now because if the handwritten purchase
because if the handwritten purchase order is there of course you you will
order is there of course you you will need to use an OCR tool but if you want
need to use an OCR tool but if you want to use if it's a complex data you can
to use if it's a complex data you can use dockling but again we'll need to
use dockling but again we'll need to look at data you can even use lang
look at data you can even use lang extract
extract uh which I just showed you right
Okay. So let's see how many of you are running this by the way this LLM based
running this by the way this LLM based chunking and how many of you are not
chunking and how many of you are not running this.
running this. Can you just small brief meaning of
Can you just small brief meaning of named oh ne is actually named entity
named oh ne is actually named entity recognition. So named entity recognition
recognition. So named entity recognition is named entity can be a name of a
is named entity can be a name of a person can be address of a person named
person can be address of a person named entities if you need to identify from a
entities if you need to identify from a document that's called named entity
document that's called named entity recognition.
recognition. So it's like if you want to extract
So it's like if you want to extract specific entities like by entities it
specific entities like by entities it can be a personal ident identity
can be a personal ident identity information anything else.
So okay so I can see that many people have finished running this actually I'm
have finished running this actually I'm also almost done over here but you can
also almost done over here but you can see uh that in this document
whenever I mention about LLM based chunking I mentioned the trade-off of
chunking I mentioned the trade-off of computational size right and expense
computational size right and expense computational expense already you must
computational expense already you must be observing that all previous chunking
be observing that all previous chunking methods ran in a fraction of a second
methods ran in a fraction of a second but the moment we used a generative
but the moment we used a generative model with a trillions of parameters
model with a trillions of parameters It takes huge amount of time and right
It takes huge amount of time and right now since we are doing it on a Google
now since we are doing it on a Google collab it's fine but for an industrial
collab it's fine but for an industrial project this time can be extremely
project this time can be extremely prohibitive to the client in terms of
prohibitive to the client in terms of cost.
Remember every call which we are making is charged. So now we have 2360 chunks
is charged. So now we have 2360 chunks over here right and we can even print
over here right and we can even print out these different chunks how they look
out these different chunks how they look like etc. Again since it's LLM based
like etc. Again since it's LLM based chunking we are at the mercy of the
chunking we are at the mercy of the language model to decide where to make
language model to decide where to make splits. But this seems to be much better
splits. But this seems to be much better than semantic chunking. Why? Because in
than semantic chunking. Why? Because in semantic chunking we used a very simple
semantic chunking we used a very simple model. Remember we used a model which
model. Remember we used a model which has around 1,000 times lesser parameters
has around 1,000 times lesser parameters or 10 to five times lesser parameters
or 10 to five times lesser parameters than the GPT model which we used
than the GPT model which we used for LLM based junking. So of course
for LLM based junking. So of course since we used a larger model it has to
since we used a larger model it has to do a better job. So here it has
do a better job. So here it has converted this into 2360 chunks which is
converted this into 2360 chunks which is much better than the 12,000 chunks which
much better than the 12,000 chunks which we saw in the case of semantic chunking.
Uh the LLM evolutionary tree someone has asked right so if you can just search
asked right so if you can just search there is a GitHub repository
there is a GitHub repository was being maintained practically it was
was being maintained practically it was being maintained until a certain amount
being maintained until a certain amount of time that's where you'll have the
of time that's where you'll have the graph
graph anyway. So now the five chunking
anyway. So now the five chunking strategies are done. And here is where
strategies are done. And here is where we can actually do the evaluation of
we can actually do the evaluation of these chunking strategies. This is going
these chunking strategies. This is going to be a simple statistical evaluation.
to be a simple statistical evaluation. But you can actually see the average
But you can actually see the average chunk size. And here I'm looking at the
chunk size. And here I'm looking at the number of words in a chunk.
number of words in a chunk. So this is just average chunk size is 62
So this is just average chunk size is 62 words. Number of chunks is 33 to1. And
words. Number of chunks is 33 to1. And the size variance across the chunks.
the size variance across the chunks. So just take a look at the different
So just take a look at the different things here. The first thing which you
things here. The first thing which you clearly observe is that the number of
clearly observe is that the number of chunks in semantic chunking are huge.
chunks in semantic chunking are huge. They're much higher compared to all the
They're much higher compared to all the other types. And the number of chunking
other types. And the number of chunking in structural number of chunks in
in structural number of chunks in structural chunking is much lesser
structural chunking is much lesser because we are taking each page as one
because we are taking each page as one chunk. Sorry, each chapter as one chunk
chunk. Sorry, each chapter as one chunk if you remember. So every chapter is one
if you remember. So every chapter is one chunk. That's why there are 17 only 171
chunk. That's why there are 17 only 171 chunks. But in semantic chunking we used
chunks. But in semantic chunking we used a transformer model which is the open
a transformer model which is the open source from hugging phase. So uh it made
source from hugging phase. So uh it made very granular chunks. Then take a look
very granular chunks. Then take a look at size variance. Fix chunking is
at size variance. Fix chunking is extremely compact. Semantic chunking is
extremely compact. Semantic chunking is also quite compact it seems because why
also quite compact it seems because why semantic chunking compact? Because there
semantic chunking compact? Because there is no one stretch of paragraph where
is no one stretch of paragraph where semantics is apparently retained. But
semantics is apparently retained. But structure based chunking there is huge
structure based chunking there is huge size variance because
size variance because there is a lot of difference between the
there is a lot of difference between the chapter page length
chapter page length and LLM based chunking if you see it
and LLM based chunking if you see it always provides a moderate performance.
always provides a moderate performance. It provides balanced results which might
It provides balanced results which might be indicative that this is a good thing
be indicative that this is a good thing to go ahead with but again it's
to go ahead with but again it's extremely expensive.
extremely expensive. So in this particular case just let's
So in this particular case just let's say you are doing this problem for a
say you are doing this problem for a nutritional chatbot we have not
nutritional chatbot we have not implemented the rack pipeline but
implemented the rack pipeline but already we know which methods not to use
already we know which methods not to use right
right we should probably not use LLM based
we should probably not use LLM based method because it takes a lot of time
method because it takes a lot of time uh we should probably not use fixed
uh we should probably not use fixed chunking because it's it's it's not the
chunking because it's it's it's not the ideal thing to do is break according to
ideal thing to do is break according to chapters
chapters and semantic chunking is also not very
and semantic chunking is also not very good because the number of chunks which
good because the number of chunks which are been given to us are very large. So
are been given to us are very large. So we might do a structural chunking or we
we might do a structural chunking or we might do fixed chunking with let's say
might do fixed chunking with let's say more number of sentences grouped
more number of sentences grouped together etc. These are the kind of
together etc. These are the kind of insights on chunking evaluation which
insights on chunking evaluation which you can get at this stage. Um
you can get at this stage. Um yeah we should definitely benchmark
yeah we should definitely benchmark time.
time. Uh recursive chunking again is a good
Uh recursive chunking again is a good trade-off between all the things if you
trade-off between all the things if you check and that's always the case.
check and that's always the case. Recursive chunking time is also good.
Recursive chunking time is also good. Size variance is less. Number of chunks
Size variance is less. Number of chunks are moderate. Average chunk size is also
are moderate. Average chunk size is also moderate
moderate because as I said, it's the best of both
because as I said, it's the best of both worlds, right? It does not have too
worlds, right? It does not have too large of chunk sizes and at the same
large of chunk sizes and at the same time, it maintains the structure.
time, it maintains the structure. Finally, you can visualize these things
Finally, you can visualize these things which we just saw for the chunk size.
which we just saw for the chunk size. Structure based chunking of course has
Structure based chunking of course has the largest chunk size and semantic
the largest chunk size and semantic chunking has the smallest number of
chunking has the smallest number of chunks. It becomes reverse over here.
chunks. It becomes reverse over here. Semantic chunking has the largest number
Semantic chunking has the largest number of chunks. Structure based chunking has
of chunks. Structure based chunking has the smallest. Recursive and LLM are
the smallest. Recursive and LLM are moderate. And chunk size variance for
moderate. And chunk size variance for structure based chunking it's way higher
structure based chunking it's way higher than the rest.
than the rest. You can even do a box plot of the
You can even do a box plot of the um sizes. Right? So for structure you
um sizes. Right? So for structure you can see the variance in the chunk size.
can see the variance in the chunk size. So final takeaway here is that structure
So final takeaway here is that structure based chunking produces the largest
based chunking produces the largest chunk fewer in number but with very high
chunk fewer in number but with very high variance.
variance. It is best for capture capturing entire
It is best for capture capturing entire sections less balance for downstream
sections less balance for downstream models because of the size variance.
models because of the size variance. Semantic chunking produces very small
Semantic chunking produces very small chunks and the highest number of chunks.
chunks and the highest number of chunks. It preserves fine grain context but it
It preserves fine grain context but it risks over fragmentation which means
risks over fragmentation which means chunk size is too small that we are
chunk size is too small that we are really not capturing any meaningful
really not capturing any meaningful information in one chunk.
information in one chunk. Fixed size chunking produces consistent
Fixed size chunking produces consistent moderate chunks with low variance
moderate chunks with low variance but it ignores semantic boundaries.
but it ignores semantic boundaries. Recursive chunking as always recursive
Recursive chunking as always recursive and LLM chunking are balanced
and LLM chunking are balanced approaches. But I I actually did one
approaches. But I I actually did one mistake here. I should have benchmarked
mistake here. I should have benchmarked time also. If I would have benchmarked
time also. If I would have benchmarked time, LLM would definitely be out of the
time, LLM would definitely be out of the picture. It's two orders of magnitude
picture. It's two orders of magnitude slower than other chunking methods
slower than other chunking methods actually.
Yeah. So this this brings us to the end of
of day one of this workshop
day one of this workshop uh where we actually covered let me do a
uh where we actually covered let me do a quick summary of what all we have
quick summary of what all we have covered.
covered. I again want to thank all of you for
I again want to thank all of you for staying till the end. And I had
staying till the end. And I had originally planned it to be a 1 and a
originally planned it to be a 1 and a half hour session but it actually became
half hour session but it actually became a 3 and 1/2 hour session. So we have
a 3 and 1/2 hour session. So we have covered file parsing and we have covered
covered file parsing and we have covered chunking so far and still we have to
chunking so far and still we have to cover embeddings which we'll cover
cover embeddings which we'll cover tomorrow. Then we have to cover
tomorrow. Then we have to cover evaluation
evaluation and then I want to show you how to build
and then I want to show you how to build this production level rack pipeline.
Uh thanks a lot everyone uh for staying for three and a half hours. I think many
for three and a half hours. I think many of you actually stayed. Um if you like
of you actually stayed. Um if you like today's lecture definitely attend
today's lecture definitely attend tomorrow because tomorrow we'll directly
tomorrow because tomorrow we'll directly take off from where we left. We'll
take off from where we left. We'll directly so this chunking thing has
directly so this chunking thing has finished. Now we'll start with
uh embedding.
embedding. So we have embedding remaining. We have
So we have embedding remaining. We have retrieval remaining. Then we have
retrieval remaining. Then we have generation remaining and finally we have
generation remaining and finally we have the production level system remaining.
the production level system remaining. Thanks everyone. Then I look forward to
Thanks everyone. Then I look forward to seeing you tomorrow.
seeing you tomorrow. All right.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.