YouTube 字幕：
Production level RAG Workshop: Part 1

不必从头看完视频——获取完整字幕，搜索关键词，一键复制。

AutoDub

听懂YouTube外语视频

沉浸式YouTube翻译中文配音

告别语言障碍，拥抱全球优质内容

免费使用

视频字幕

视频摘要

Summary

Core Theme

This workshop introduces Retrieval Augmented Generation (RAG) by building a practical, from-scratch pipeline. It emphasizes understanding the underlying engineering trade-offs and complexities beyond introductory tutorials, aiming to equip participants with the knowledge to make informed decisions in real-world RAG implementations.

Mind Map

点击展开

点击探索完整互动思维导图

Okay, so let's get started with today's

workshop. I'm really excited for this

because uh I have planned for this

workshop for quite some time and this is

the first time I'm actually conducting

this live. So we have two sets of people

uh who are in the participants today.

One is who are already present in our

live classes and uh one who are just

specifically attending this workshop. So

I'm still admitting a few participants.

Yeah. And regarding the lecture

recordings, the I'm recording today's

lecture and tomorrow's lecture as well.

And we'll be immediately sharing with

all of you once the recording is done.

So no need to worry about receiving the

recording after the lecture. For the

people enrolled in the live classes,

I'll upload it to the dashboard as

always. And for the others, I'll send

you through email.

Uh so the reason I thought of conducting

this workshop is because the way I have

also thought about retrieval augmented

generation or rag has changed a lot over

the last I would say um last two years.

There is a question do we need to code

along with you? Yeah. Yeah definitely

will need to code. So I would highly

recommend not attending this workshop on

phone because

uh it's just not a good experience.

We'll be coding everything from scratch.

So u if you attend on phone then um

we'll be doing everything on Google collab.

collab.

So let me get started with the lecture

objectives. And when I say lecture,

we'll have two lectures actually within

this workshop.

Uh I'll first tell you what all we will

try to accomplish and then we'll start

going through each and every single

thing in detail. So I'll tell you my RAG

journey. Right? So RAG stands for

retrieval augmented generation. No need

to be scared by this name. The name

itself looks a bit complex like retrieval

U we'll see what all of these words

actually mean. But before that I'll tell

you my u

uh my experience with rag. So there are

some questions in the chat. What do you

mean by students attending live classes?

So there is a live batch which is going

on in the hands-on LLM series where we

have this is in the middle of our course.

course.

I will cover agentic rack but not in

this workshop in our subsequent lectures

of the live classes

Okay. So if you take a look at rack

tutorials, you'll see that there are

number of tutorials which pop up and

which are small. So there are some

tutorials which are 10 to 15 minutes in

nature. There are some tutorials which

are just 5 minutes. Uh there are

actually rack tutorials which teach you

how to build a chatbot in 5 minutes.

Then there are these 20inut tutorials,

25 minute tutorials. And uh when you

watch these tutorials, you feel that

okay this is simple. I have understood

what is retrieval augmented generation.

Um but it's actually not the case. Only

when I started solving industrial

problems then I realized that the whole

pipeline is actually far more

complicated than what is shown in these

introductory videos. There are several

things which no one ever talks about.

Right? So for example uh chunking.

Chunking is very briefly mentioned in

introductory videos, but no one codes

through chunking and actually teaches

engineers which chunks to use at what

time. If you don't know these

terminologies, don't worry. I'm going to

cover every single aspect in detail.

Then second is file parsing. In most of

these tutorials, it's already assumed

that you have the file, but in fact,

that's one of the most important steps

and frankly quite quite challenging.

Then another aspect which is neglected

is evaluation

uh which I'm calling as evals and in

industrial settings has become critical

because okay you build a rack pipeline

and you submit it to the client or for

your internal workflow but is it working

or not? How are you continuously

monitoring whether your rack pipeline is

delivering good results or not? And more

importantly, there is the question of

embeddings. Right? In all of these short

tutorials, they randomly use vector

databases or vector stores without even

thinking about why do we need to use

vector stores? Can we just do embeddings

in PyTorch? And what are vector stores?

Uh we are going to see all of these

today. In fact, at several portions of

this tutorial, I'm going to have this

section called engineer's choice.

uh this engineer's choice section I

specifically curated based on my own

industrial experience. So unlike all of

these tutorials, I don't want to tell

you that go ahead and use this, go ahead

and use that. But I'll be making you

aware of the trade-offs. Um and when I

say trade-offs, I mean how you should

select the tool for your particular use

case. So my goal after this workshop is

that when you face these trade-offs in

industry or wherever you want to

implement it, you should be in a

position to decide what's the best tool

for me.

I'm going to show you the different

trade-offs which I encounter in our

industrial problems.

Um and then so we are going to assemble

a whole rack pipeline from scratch. And

when I say from scratch, we are not

going to use a library like lang chain

or lang graph today or tomorrow because

all of that will seem very simple to you

after going through this workshop. We

are going to code everything from the

ground up.

Um and while doing that we'll see what

are the different engineering choices

you need to make. While doing that I'll

also show you the different packages and

libraries which are coming up which are

useful in industrial settings. So look

at this workshop not as a toy series but

rather as an industrial level workshop

where in when you go to industry things

are not white and black right they are

gray mostly there is no right solution

but the engineer who stands out is the

one who can figure out what's the best solution

solution

for the given problem that's what I want

to teach you so if you have any

questions at any aspect right ask me my

goal is for you to understand the nuts

and bolts of this rack in detail. So it

should not just be a terminology which

you think that rag okay rag is easy I

can cover it in 10 minutes

after this all of you will be able to

build chat bots and hopefully you will

be able to understand the trade-offs

when we build pipelines.

So what's our end goal? Our end goal

after this workshop is that we want to

build an application such as this where

uh if you see this is a rag based

nutritional chatbot and it's built

entirely from scratch.

There are also two types of rag systems.

One who directly provide the answer and

one type of a system which when it

provides the answer it also provides the citations.

citations.

So we are also going to look at how to

provide references, how to provide

citations and and what does it mean when

it says 56% match or 55% match. The

front end we will not spend too much

time on the back end and front end

coding. We are going to do it through

lovable. So all of us might get

different looking websites at the end of

this workshop but that will be the fun

of it right. We'll we'll share the

websites which all of us have obtained.

U yeah and so the lecture will be such

that I will explain many aspects through

a whiteboard

and then there are several code files

which I have designed. All of these code

files I'll share with you at

specific intervals

within this workshop. All of them will

be on Google collab. Tomorrow we are

going to use

um tools. The only external tools we'll

need is Superbase. How many of you have

heard about Superbase by the way or used

it before? It's fine if you have not

heard of this tool. I'm going to show

you what it is because it's used it a

used a lot in production level settings

these days. Uh Superbase is one tool

we'll need. And second is lovable.

for everything else. I believe that even

if you have the T4 GPU which is

provided for free through Google Collab,

you'll be able to follow along in this workshop.

Uh as the

fundamentals or as the I should say

guiding principle of

most of this lecture.

uh the prompt engineering

rules which we saw in some of our

previous lectures are going to be important.

important.

So let me just introduce the seven key

elements of writing an effective prompt.

First is so the rule is called P IC F A

TD which is basically when a prompt you

have to define many things instead of just

just

u writing a quick prompt. The first is

persona then you have the identity then

you have the instruction then you have context

context

then you have format then you have the

audience the tone and finally the data.

These are the seven key elements

u of an ideal prompt and we are going to

use this in when designing rack

pipelines. In fact, the base of

everything which is to follow such as

RAG, then later we'll look at agentic

workshops, we'll look at MCP. The base

of all of that is a good prompt.

U so please keep these seven things in

mind when writing an effective prompt.

Don't just write something quick. And

I'll stress that today when we are

building this uh this chatbot pro this

chatbot project also it is extremely

important that you spend spend time

writing a prompt because think about 5

years into the future right if English

is going to be the new programming

language then prompt engineering

then I'll take some questions in the

chat but before that let me tell the

philosophy of this workshop the The way

I have designed this workshop is that it

will cover all these three aspects. Uh

it will cover foundations that is the

most important according to me where you

should know the nuts and bolts of the

entire rack pipeline and you should be

able to make engineering decisions when

the time comes. Second is practicals. So

at various places I'm going to give you

practical insights uh regarding which

chunking strategy to use, which

embedding strategy to use, how to deploy

the rack project that comes in

practicals and then finally I'm going to

leave you with research questions or

research directions. So after this

workshop, I'm going to show you what are

the open research problems in this area

which you can now immediately start

working on uh after these live sessions

are done.

There are let me take questions in the

chat. So Prashant has asked this might

be a question for later but should

everyone move from plain vanilla rack to

agentic rag? Uh in production setup we

are only seeing 55% accuracy with

standard rag. That's a good question

Prashant. So I'll tell you my experience

from industry right so far we have done

around 16 industrial projects. out of

those 10 projects have been rag based

projects and in that we have been able

to satisfy the customer with a pure rag

pipeline and when I say vanilla rag it's

not just a simple you upload a PDF you

query from that PDF and you give it to

the LLM in that pipeline which we

designed for the customer we did not use

agents but we used many modern things

which grounds the responses I'm going to

share that that knowledge also So today

but vanilla rag is working for problems

which are not too complex in my opinion.

So let's say an industry comes there are

many chatbot requirements in industry

all of those can be solved with

vanilla rack then not just chatbot there

are some level two requirements which is

basically a company wants to make a code

generator based on their docs that can

also be solved by rag.

Agentic rag

plays a very crucial role when you want

access to external tools or when you

want to do something complex. Let's say

a company wants to build its own deep

research agent.

uh that is a difficult thing and their

traditional rag won't play an important

role but at least in my experience over

the last one year the one of the main

reasons I thought of taking this

workshop is rag is still very relevant

and vanilla rag for many company

problems of level one and level two

let's say chatbot generation uh code

generation based on what they have etc

then another question in the chat is

when will the lecture notes and uh

recordings we uploaded. So lecture notes

and the recordings I'll share after each

lecture is done. So after the first

lecture I'll share it on each

participant's email the link to the

whiteboard notes and the link to the recording.

Agentic rag is about yeah agentic rag is

basically I I'll explain that to you but

in a rag pipeline you have access to

embeddings right so think of embedding

store as a tool

if you start thinking of embedding store

as a tool then it suddenly becomes an

agentic pipeline where along with all

the other tools the agent also has

access to the embedding store or the

vector store.

Would you discuss query transformations?

Yes, I will discuss that towards the end

of this workshop.

Plain rags would they help in

application modernization like smaller?

Yeah. Yeah, definitely. If you have

let's say one common application which I

would like to share with all of you is

ITSM tools. How many of you are aware of

In information technology service

management, right? So if you look at at

least India, there is a whole middle

layer of companies which operate in the

ITSM space where they provide

uh let's say you are on Razer Pay and

you make a payment through Razer Pay.

What happens on their back end? How is

the payment stored and processed? That's

essentially information technology

service management. If you are on book

my show and you if you book a movie

ticket what happens on book my show

server how are they managing different

clients which are booking that's usually

managed through an ITM company so these

companies usually provide dashboards to

players like Zomato book my show players

which need an IT infrastructure those

dashboards are based on legacy systems

or traditional pipelines for a very long

time and now they want to integrate chat

bots within those dashboards

For these type of integrations, rack

systems will still play a very crucial

and an important role

because they usually have a fixed data

Llama index. Yeah. So, llama index and

lang chain, lang graph, all of them can

implement rack pipelines very easily.

And if I were to make a tutorial on

llama index using rack that will

probably be a 25 30 minutes tutorial.

But my main aim here is to build such a

strong foundation that after this any

tool will seem very simple to you.

Whether it's llama index, langraph, lang

Um so let's get started now. So I've

used this terminology rag many times so

far, right? And those of you who don't

know or have not heard of this do not

worry about it. Uh I'm going to motivate

it in a lot of detail.

So for the purpose of this workshop we

are going to consider imagine that we

are in the nutritional domain. So the

document which we are going to consider

is this 1200page document on human

nutrition and I'm going to share this uh

this drive link right now with all of you.

We will see what these different things

are. We don't need them right now for

now. All you can do is that when I'm

showing this PDF to you, you can go

ahead and download this PDF from this

drive link which I have just shared on

the chat so that you can also refer to

this PDF along with me as I go along.

Now I want to ask all of you a question

right? Imagine that you are working in

an industry

uh you are in the engineering team and

you are in the you are in a meeting with

a client. The client has started a

nutritional startup, right? And they

want to spread awareness about nutrition

globally and for that they want to make

a chatbot.

Okay? And they want to make a chatbot

which looks something like this. So

essentially a customer will come, a

customer will log in and a customer will

ask some questions and the answer which

will be generated

has to be very specific and has to be

very grounded. I'll use this term

grounded a lot. What does grounded mean?

Grounded with whenever someone says

grounded, you should ask grounded with

respect to what? So this startup wants

its answers grounded with respect to the

encyclopedia of knowledge which it has

which is basically this book for now.

It's a 1200page PDF

uh about human nutrition 2020 edition

and it talks about a huge number of

things starting from basic concepts in

nutrition to uh human body then it goes

to water and electrolytes whatever it

covers every single thing about

nutrition they want their answers

now this same example I'm taking of

human nutrition you translate it to

other domains as well. If you want to

make a chatbot for customer service,

there will be a manual for customer

questions and what should be the ideal answer.

answer.

Uh if you are making a chatbot for this

ITSM, there will be a question, there

will be a manual of tickets which

customers usually raise and a sample of

the solutions. Now my question to all of

you is that let's say you are sitting in

that meeting as an engineer, right? And

this client comes to you with this

request of making a chatbot. Forget

about rag or this terminology of

retrieval augmented generation. Let's

think from first principles. How will

How how exactly will you build this

that's the goal, right?

The goal is to build a nutritional chatbot.

chatbot.

But what's the

key terminology which I mentioned? It

should be grounded.

It should be grounded in factual

knowledge based on the book which I just

shared with all of you. How will you do

that? Add the document content somehow

as a prompt. Use the PDF and pass it to

the LLM.

Okay. So what Madusan has mentioned

let's say you don't know all of this

there that is the rack pipeline you are

mentioning I'm I'm asking you to think

from first principles where forget all

of your knowledge let's say the only

thing which you have let's say is that

you have a chat GPT right you have

access to chat GPT or any LLM for that matter

matter

let's say you have this and that's the

add the PDF in the context of the LLM.

Instruct to answer information found in

the PDF. We will load data to chat GPT.

Okay, so the simplest thing which many

people are saying and which is aligned

to the data portion of this prompt,

right? I showed you this seven steps of

the prompt and there is this data

portion over here which is where you

usually feed the data and which is where

we usually ask the question.

So many people are saying that okay this

seems like a simple enough task why not

I just use

there is also an answer from Prashant

about keyword based search

so keyword based search okay that can be

done so but you want to use a modern

approach so you propose to the client

that hey this seems like an easy thing

to do we just make a front end

we just make a front end

and that front end looks something like

this. This is the human query. This is

the answer. This is the human query.

This is the answer. So human query I'm

mentioning by HQ and this is the answer

human query and HQ.

So you you start thinking from the front

end then you think that whenever the

human query is asked you pass it to an

LLM directly. Whenever a human query is

asked, you pass it to an LLM

like chat GPT and along with this you

also pass the PDF

Then you make this API call to the LLM

That answer now you show in the front end.

end.

Then the user asks another query. You

again make an API call to the LLM. You

again pass the entire PDF in the context

of the LLM and you get the answer.

That's what would have naturally come to

my mind if I were thinking from first

principles and if I did not know

anything about uh

retrieval augmented generation really.

But what are the issues with this approach?

approach?

Can you try to think that as an engineer

you go back and you try to implement

this now what will be the issues with

this approach so Amit is saying high

cost Samarat is saying too many tokens

context length too many tokens so let's

actually see this right and I encourage

all of you to do this in practice go to

chat GPT

I went to chat GPT right now and I have

asked I put this exact same PDF and I

asked what are the number of tokens in

this document

What are the number of tokens in this

document? Does it fit your context

window? What is context window? Context

window is the number of tokens which a

language model can look at at one time

before producing an answer. So think of

it something like imagine

you are being bombarded with information.

information.

Let's say someone tells you about one

topic then the lecture goes on for 2

hours, 3 hours, 4 hours, 5 hours. At

some point you will start losing

information right context window is the

amount maximum amount of information you

can fit in at one time and produce

coherent answers.

Now whenever an LLM like GP is designed

the context window is fixed.

Um so if you ask something if you put

this document and if I ask what are the

number of tokens in this document does

it fit your context window. So number of

and the context window of chat GP and

I'm using GPT5 here it's around 128K.

So here we see that the entire document

does not fit into memory at once.

And what will happen if the entire

document does not fit into memory? If

you ask a question let's say related to

uh if you ask a question if the human

asks a question related to this chapter

nutritional issues and if in the context only

only

tokens up till page number 700 are going

to be filled then there is the entire

context is lost the LLM will not be able

to answer correctly answers will be

wrong and then what will the LLM do you

know what the LLM will do. After this point,

the LLM might start to answer from its

own knowledge of pre-trained

information. The LLM says that this this

document does not fit in my context

window and I don't see the relevant text

in my context. So, I'll use my own

pre-training data. That's fine. I don't

need to rely on any document. And when

an LLM becomes overconfident like that

and starts thinking like I have data

from my own knowledge or my own corpus

that leads to one of the major problems

which retrieval augmented generation

actually solved and that's the problem

did not fully solve it but was a good

step in that direction. So if you if you

pass the entire PDF at once it might

exceed the context window of language

models and that might lead to hallucinations.

hallucinations.

What is the solution to this problem?

The solution to this problem came with

this paper which was released in 2021

and the solution to this problem is

Now

you can read through this paper

definitely but the idea of retrieval

augmented generation is very similar to

an example which you you you all know.

Let's say you you have also been given

this text. You have been given this text

on human nutrition

and you have an exam but that's an open

You have an open book exam. So let's say

I hope all of you know what an open book

exam is right in an open book exam you

you can put the book in front of you and

you have access to all this material.

So you have access to the entire book

actually in an open in an open book

exam. So you are sitting in that lecture

hall and you see a question you see a

question related to

proteins let's say

how will you answer this question at

that point can all of you try to think

about it if you are sitting in that open

book exam and you have been asked a

yeah go to index find the topic

and find it in chapters index. Yeah. So

what all of you will probably do is that

you will you will look at this word

proteins then you will go to this PDF

from start. You will maybe look at uh

the chapter of index or table of

content. If it's not there in the table

of content you will go through all the

pages and you will try to find that page

where this particular information shows up.

up.

Then you will highlight that

you will highlight that information. You

will use this knowledge. So the question

which might have been asked might not be

completely related to this information

but you will use that information.

from the book. And plus another key

component which of course you need is

So your own mind already has some

information right because you might have

studied for this exam. On top of that

you will get some information exactly

based on this book contents and then you

will get the answer.

Now this entire pipeline is very similar

to what a retrieval augmented generation

is. retrieval part is this part retrieval

and generation part is this part.

So if you are not fetching context from

this book and if this whole thing was

not there. So let's say

I will move my screen now to this.

Let's say if only this was there. That's

just the generation part. Right? But now

you have augmented the generation part

with some sort of retrieval from this document.

document.

That's where the term retrieval

augmented generation actually

There is a question on the chat that do

you plan to share your screen? Is my

It's visible, right? Okay. So, I guess

it was frozen for some time whenever I

go to this prompt engineering book. So

Yeah, now I'm back to my main screen. So

we retrieved context from this document

and we also generated answer from our

own own mind. That's retrieval augmented

generation. How does it translate in the

case of the startup app which we

discussed the mind here which I've

mentioned is the LLM with its own

pre-trained knowledge and instead of

passing the entire context to the LLM we

only pass context which is relevant and

instead of using the word pass a more

fancy word is retrieve we only retrieve

context from that PDF which is relevant.

So instead of this earlier pipeline

which we saw, what if we make a

different pipeline? What if the pipeline

now is something like this? We still

have our front end and this is the human

question and this is the answer, right?

So when the human asks a question, it

will again go to the LLM. That is fine.

But the LLM will also somehow magically

get only that piece of context which is relevant.

relevant.

And now that's the retrieval part. This

relevant context is passed to the LLM

Do you see the problem which this will

solve? We we started out with the

context problem. Right now we don't have

to pass the entire PDF into the context.

We only pass the relevant bits of

information. What are the relevant bits?

the same bits which as a student we

highlighted over here when we were doing

this open book exam that's the relevant

bit of information which is passed to

the LLM so the context window problem

will be solved the natural consequence

of the context window problem being

solved is that the LLM will now produce

answers which are more factual the LLM

will now produce answers which are more grounded

grounded

uh in reality based on the exact

document which the client has given to

me based on this exact document which

the client has shared with me my answers

now I can be sure that they can they

will be specifically tailored. So when I

ask some question here

and when you see the answer being

printed on the screen you will also see citations.

citations.

Yeah. So these citations actually refer

to what portion of the document

this generation is referring to. See

this this thing directly comes from the

document itself on page number 592.

This comes directly from the document on

page number 53.

So you are retrieving relevant pieces

from the document from various places.

It does not need to be from one place.

You are passing it to the context of the

LLM and then you are generating the answer.

Okay. So that's the whole concept of uh

uh that's the whole concept of rag. So

if you have any questions please ask

I'll be taking all the questions through

the chat since the size of the room is

quite large. This is the whole concept.

So I just want to make sure the ground I

mean the stage is clear when we move to

the next part. So one of the teaching

philosophies which I follow is that

before explaining anything you need to

understand the context behind it. So I

know many of you might be wondering

about the details here right like how do

we get the relevant context

um which LLM are we going to use? Are

you doing using open API key? So let me

LLM I'll come to that. That's again an

engineer's choice. We can use an open

source LLM and a closed source LLM. I'm

going to do both. I'm going to use an

open-source LLM. So, we are going to

deploy a local rack pipeline

and I'm going to use a closed source LLM also.

Uh Jiny Gems uses rag. Yes. In fact,

many of these players like Perplexity,

they have a rag pipeline underneath.

There is a question by sankit. What if

the question is like summarizing the

whole document? Wouldn't it have to

parse the entire doc? Yeah. So for

summary there are multiple other things

which we can do. For example, if you go

to Gemini right uh and I also encourage

all of you to try this.

How many of you are aware of the context

You must be aware of this right. So

actually what I did is along with chat

GPT you pass the same document to Gemini

and what Gemini says is that this

document does fall within my context

window because apparently it has context

window of the size of millions.

Now for Gemini such a thing might

actually work because the context window

is very large

and there are many reasons why how

Gemini has improved its context window.

If anyone of you is interested in that,

uh I think the answer lies in this blog

uh which is also a book by the way.

Yeah, just check this if anyone is

interested in that. Anyways, that was a

digression. There are multiple questions

in the chat related to retrieval from

multiple documents. So whatever I have

shown you right now, right, it's just

one document. You can retrieve from as

many documents as you want. It does not

need to be constricted to a single document.

document.

uh I'll and then there is a question

that where do we can we retrieve from a

database or other format like images we

can I'm going to come to that when I

come to the injection pipeline

data injection pipeline so Samir has

asked hallucination is due to large

context so hallucination can happen due

to multiple things in this case there

will definitely be hallucination because

of large context because the whole PDF

will not fit in the context window so

The LLM will have to rely on its own

pre-trained knowledge. So the answers

which it generates won't be grounded to

this document. That's why we call it as

In case of rag application, how

important is the quality of LLM?

Extremely important in fact. But again

there is a trade-off here Amit. What is

the trade-off? The trade-off is with

respect to what the organization values.

If the organization values privacy, you

want to have an open-source LLM on your

own server. Uh we are in fact going to

use an open source LLM on our local GPU.

I I will come to the trade-offs when we

come to the engineer's choice section in

How important is quality? So this I

already answered is the data if the data

is in tabular form. I'll come to data

part just right now. So all of you who

have questions about the data format,

that's the next point which I'm coming to.

to.

Does rag help in improving named entity

recognition? 100% it does. Um

and in fact for named entity recognition

you have to do chunking in a very

specific manner. We did an industrial

project recently which had named entity

recognition. For that you'll have to do

chunking which is called as structural

Okay. So there are many questions which

I will slowly start answering. Many of

these questions will become clearer. But

one thing which I do want to address is

what was rag in 2021 and what is RAG

now. So in 2021 retrieval augmented

generation was this cool new thing which

had come to prevent hallucinations and

it's still relevant

because it still solves industrial

problems. But now just zoom out a bit

and take a look at retrieval augmented

generation in context of something which

is called context engineering. So now

there is this new field which is

emerging which is called context engineering.

engineering.

We talked about context a lot right in

rag and already I mentioned that as the

context window of LLM is increasing

for example what if the context window

of all LMS becomes 5 million

it might happen that in the next 2 years

why because you can just pass the entire

PDF to the LLM

again there is a trade-off like even in

Gemini I would not do this. Why I would

not do this? Because Gemini charges you

per token in the input and per token in

the output. If you pass let's say 100

PDFs, even if the context window is

large, you will incur that much

prohibitive cost. Right? So even if the

context window of LLMs becomes large,

rag will still be valuable to reduce costs.

costs.

Although LLM can handle it from a

performance point of view, it's still

not in your best interest to pass the

full document. It's like using an

elephant to kill an ant.

Although you can do it, that does not

mean you should do it. It will be costly

for you for every single request. Why do

you want to pass all the documents?

You'll get charged per token.

U but now think of rag in the context in

this within this umbrella of context

engineering. In the last class we have

discussed prompt engineering right

that's intimately connected with rag and

one more thing which is intimately

connected with both of this is memory

um essentially if you are interacting

with this chatbot this chatbot which I

showed to all of you

if I'm interacting with the chatbot

let's say user logs out and comes the

next day how does the LLM know what

conversation has happened yesterday

imagine that I go to a nutritionist,

right? I go to a nutritionist and I ask

a question or I ask multiple questions.

I have a 1 hour session and I go back

again the next day. The nutritionist

will of course remember the thread of

our previous conversation

or the therapist if you go to a

therapist they of course have to

remember what has happened in the past.

So when you talk about context

engineering memory becomes a very

crucial role. Here also there is a

trade-off. The more memory you save for

an LLM, the more context it has, the

more again cost. The cost increases, the

context size increases.

But when you think about rag these days,

you have to think in terms of what's the

context window of the LLM, do I really

need rag? Okay, if the context window is

large enough like Gemini, I don't really

need rag, but I I still can do it to

save costs, then how much costs can I

save? Why have I mentioned prompt

engineering here? Because the success of

your rack pipeline also depends on how

Sanjiv is saying can you explain context

engineering? Yeah. So if you think about

context engineering

the best way to explain context

engineering is if you want to make a

production level app like a rack chatbot

how are you going to manage different

aspects which show up in the context.

What are the different aspects? One is

of course the information retrieved from

rag. One is the memory. One is your

current state. Then second where are you

going to save this context? Are you

going to save it in a vector database or

are you going to save it in a normal

database like Postgress? Where are you

going to save embeddings?

I'm I will come to most of these issues

in this workshop. But context

engineering is a much broader field now

and 2025 rag has evolved now in the in

four years. Now we think about rag in

terms of context engineering. The main

field is context engineering and then we

start to think that okay with this

context of the LLM that's this is the

application what's the best thing which

I can do should I do rag uh should I

just do few short prompting

by passing the whole PDF how will I how

am I going to save memory can I save

should I save all the conversations as

it is or should I save a summary of the conversations

conversations

uh think about this right if you talk

with someone

and if the conversation is for 1 hour

after 1 hour what do you remember you

don't remember exactly what that person

said right you remember the summary of

key points which your mind automatically

forms so you can use another LLM to summarize

summarize

so it is best practices around context

yeah And when you say context it means

many things. It means memory. It means

will rag be relevant in the long run

lms themselves. Yeah. So this I think I

already answered Amit. Let's say if you

are JP Morgan, right?

Uh and if you want to make a chatbot

specific to your data, I think rag will

still be relevant because passing the

entire PDF each time will be

computationally and costwise prohibitive.

Okay. So let's table the questions for

some time because now what we have to do

is that we have to get started with the

first pipeline.

Many people have asked questions about

document prep-processing and I want to

spend some time here.

Uh this is the whole pipeline which we

are going to build in this workshop. By

the way,

let me walk you quickly through

different elements. We are going to

start with this nutritional PDF. Then we

are going to do chunking. So this is the

I'm going to have a section on this.

Then we'll have a section on chunking.

Then we have a whole section on embedding.

Then we have a whole another section on

LLMs whether open source LLM or close

source LLMs.

And then finally we'll put all of this

together and run everything on a local

GPU. After this is done, we will do

production level rag and build this

website. So we do have a number of

things to cover. uh the pace at which we

are going I'm not really sure how much

time uh this workshop is going to take.

I'm I'm very happy to answer all the

questions but from your side also

uh please note that it may take more

than 3 hours

because we have to do all of these

parts. So let's see actually let's let's

let's take a call based on how much we

cover today and how much we are able to

cover tomorrow.

Okay. So the first step is data

injection right and this is often the

most neglected step in tutorials in

video sessions everywhere because it's

not very cool. When I say cool, everyone

talks about embeddings, LLMs, but the

part which I think many should

definitely be talking about is how are

you going to collect the data and how

are you going to store the data?

Um, so let me ask all of you right if

you have this PDF, how will you collect

this PDF so that a Python interpreter

knows what to do with it? How will you

open this PDF and how will you read this

PDF in code?

So we need to store ingest the data

somewhere. Our LLM is going to look at

that data and then answer questions.

But right now it's in PDF format. We

humans can see it, right? But a Python

code needs to understand it.

Uh so someone is saying PDF to text

pain of document parsing.

Yeah. So this point pre frame which you

have mentioned I'll come to that. Yeah

we will use a Python library basically

to do document pre-processing. What

document pre-processing is is

essentially downloading and reading

PDFs. Now in this section I want to talk

about three things. I want to talk about

the document which only has text.

Then I also want to talk about documents

And I also want to talk about documents

So if you have a simple document right

which is in a PDF format you can use packages

packages

to load the document and one popular

package is pi mu PDF. So I'm going to

show you several packages and the way we

decide for problems which one to use. So

this workflow which I'm giving you right

now is exactly what we do internally

when a problem comes. So check this package.

Actually let me show the GitHub version

of this package.

So Pyu PDF is a traditional Python

library for data extraction of PDF

documents. really very robust library.

What it does is that using this library

you can essentially pass in any PDF.

U you can pass in any PDF and then you

can open the PDF using py mu PDF.

Um using this library you can also read

different data.

When I say read different data you can

read different pages. So for example

this entire PDF can be ingested by this

library and then we can save what

information is there in every page. Now

let me ask you this question.

Uh let's say this image comes this image

comes what do you think this PDF

extraction library will do at this point?

So all text skip it. So I'm looking for

a specific answer. So first let me ask

whether this it will be able to deal

uh those who are answering no that is

not the correct answer.

Uh it will be able to deal with this

image because this is a digital image.

So there will be an image tag associated

with this image and this image will be

downloaded as an image format. But the

Let's say you get an image like let's

say there is a restaurant bill and

someone takes a photo of this and uploads

uploads uh

uploads somewhere. Let's say someone

takes a photo of this and uploads. Will

the library I'm showing with you deal

It will not read this type of im this

type of an image and that is one key

thing to understand. I mean it will save

this entire thing as an image but it

will not read what is the text present

on this image unless the text is typed

through a digital form. If the bill is

generated through a digital software

and every field which is entered is a

digital field

uh that will be taken into account by a

tool like this.

But if you have a image which has some characters

characters

so what do you mean by digital? By

digital I mean if you uh let's say if

yeah if I go to an invoice invoice

software tool

I I fill entries over here and I

generate PDF from this tool that will be

digital entry because every number is digitized

digitized

then a standard PDF extractor can also

read that number can also see what is

mentioned here if it's digital we can

copy text from the PDF but if it's not

digital like this we cannot copy text

from this PDF or at least normal Python

libraries cannot that is where we need

libraries which can deal with something

called OCR

so the best open-source OCR library you

can also see from the number of stars

which it has is tesseract

uh tesseract is one of the most popular

OCR libraries but before introducing

this libraries all of you should know

why OCR is needed in the first place for

the current PDF which we have many of

you gave the wrong answer here right we

don't need OCR library here this is just

a simple image there is no text on this

image it a simple PDF extractor can deal

with this image you need to use OCR

libraries only when you have images

which which are let's say handwritten

text images which have been scanned and

uploaded into a document which might be

the case for many clients so

that is the

that is the place when you need tesseract

tesseract

to tesseract can extract

handwritten text digitally scanned text

etc tesseract can extract text from

images like is

that's the second option which I wanted

to show you in this data injection

pipeline. Let's see how would the text

extractor know if the image is text or

not. It would not know, right? It would just

just

you mean the tesseract.

The tesseract knows it because the

libraries which it uses specifically

looks for text in that image. But if you

use pi mu PDF, it will not know. It will

just save that entire image as one image.

Oh, the fruit image, right? It will not

know whether there is text or not. So,

even if this image has a text, pyu pdf

will save it as an image,

but we will not know what is the text

written on that image. That's the main

issue. The image will be saved. That's

not an issue. But all the information on

that image, that image will be accessed

as a whole body. There will be nothing

like there are characters in this image

or there is text in this image. the

granularity will be lost if you don't

Then comes there is a question of

tabular data, right? How do you deal

with uh tabular data? For that what I

want to really do is introduce a third

library which has now become extremely

popular and I would say it's

hands down one of the best libraries for um

language modeling tasks. How many of you

Yeah. So dockling is relatively new.

It's I think newer than all of the other

libraries which I showed to you but they

have already so we have used docklink in

our industrial projects and it's

amazing. One reason why docklink is

amazing is because it is specifically

meant for generative AI. What do I mean

by that? So whenever dockling encounters

a table in a PDF the tables are saved as

real tables. Rows and columns are preserved.

preserved.

uh dockling can even convert a schema

into a JSON format directly

and if anyone of you has used language

models in production before you know

that it's very important for you to

retain certain elements in JSON format

or in markdown format. So if you

encounter a table somewhere or if you

encounter uh any schematic or a schema

that can also be analyzed by dockling

and that can be saved as a table.

Further dockling can be linked with an

OCR tool as well. Doc dockling can be

externally linked with an OCR tool like tesseract.

tesseract.

So you have the OCR capability as well.

You can

extract tables very easily. You can

extract schemas very easily.

In fact, if anyone of you is interested

this is the Dockling technical report

where they actually mention exactly how they

they

u they manage to retain tables in an

What happens if the input document has

text, images, tables and images with

text, right? Then exactly what you will

do in that case. If your text document

is extremely messy and if it has images,

if it has tables, if it has um

let's say images, scanned copies, then

you can use dockling and you can use OCR

along with it.

If your document is very simple like

what I have, you can use pyu pdf. If

your document does not have too many

tables but just has scanned images, you

can use tesseract.

So this is the first engineering

engineer choice section which I have. I

mentioned at the start that I will have

this section for all these parts right

data injection, chunking, embedding and

open source or closed source LLM.

So this is the first point where I have

this section called engineer choice.

When given a project, what document

processing tool are you going to use?

That's the first thing which you need to understand

understand

and that depends on the type of

documents which you really have. Now

actually before this there is one more

step which I have not discussed and that

is related to scraping.

So it may happen that some websites if

you go or some client websites have PDF

which you can download but in some cases

the data is not in PDF format. So you

need to first scrape that entire data

and then use these processing tools

which I have just mentioned to you right now.

How can we have hybrid pipeline with all

these three? So Samrat if you want

hybrid pipeline right then the best is

to use dockling with an external OCR

tool if you go to dockling in the

document itself they say that they can

handle diverse format that's good um

they can export into various formats

like markdown HTML JSON and most

importantly they have extensive OCR

support for scanned PDF and images so

this one library has all of these things

if you are dealing with complex PDFs or

if you are dealing with um

complex images rather.

One more thing which we explored at

Vijara recently is this Mistral

uh OCR. How many of you are aware of this?

They have a special language. They have

a special model which they have released

recently which is apparently supposed to

okay so one good question has been asked

in the chat that what about this

mirrorboard itself if it's to be

retrieved now let me ask this question

to all of you right let's take this

mirror board which I have

which tool will you use to retrieve text

Dockling for sure will be good but I

would probably use tesseract for this.

The reason I would use tesseract for

this is actually

what do I have here? If you think about

it, I have some images and I have some

text right which is written and this

text is very messy. So if you take images of this of course a normal pi mu

images of this of course a normal pi mu PDF will not be able to handle this. But

PDF will not be able to handle this. But I don't have too many complex like I

I don't have too many complex like I don't have any table really even this

don't have any table really even this table which I have that's an image. This

table which I have that's an image. This is not a real table. So technically I

is not a real table. So technically I don't have any tables. I have I would

don't have any tables. I have I would probably take images of this make it

probably take images of this make it into a PDF. So I just have PDF. I just

into a PDF. So I just have PDF. I just have images and I have text which will

have images and I have text which will be scanned. So I definitely need an OCR

be scanned. So I definitely need an OCR tool over here.

Yeah. Uh this Mistral right I want to spend some more time on this because

this is new. This has just come out. I think it came out three months back or

think it came out three months back or yeah we are trying it at Vijuara right

yeah we are trying it at Vijuara right now. I don't know how it is but it's

now. I don't know how it is but it's supposed to be amazing.

Uh and there are several such LLMs itself which are specifically meant for

itself which are specifically meant for doing OCR tasks

scraping right so let me tell a bit about scraping right now.

about scraping right now. So let's say you go to Mahindra and

So let's say you go to Mahindra and Mahindra website

Mahindra website and you are doing a project with

and you are doing a project with Mahindra and what they have told you is

Mahindra and what they have told you is that I want to make a chatbot which is

that I want to make a chatbot which is specific to Mahindra rise let's say but

specific to Mahindra rise let's say but they have not given you any data

they have not given you any data then what will you do at this stage how

then what will you do at this stage how do you collect the data if the client

do you collect the data if the client has not given you PDF copies of the data

has not given you PDF copies of the data or if the client has not given you uh

or if the client has not given you uh like anything about the data

like anything about the data The only thing which you can do at this

The only thing which you can do at this point is called as scraping.

point is called as scraping. Uh yeah. So what you have to do is that

Uh yeah. So what you have to do is that you have to go through different

you have to go through different sections and you have to use a scraping

sections and you have to use a scraping tool to scrape this data. I'm going to

tool to scrape this data. I'm going to tell you two to three scraping tools

tell you two to three scraping tools which can be used. So first is called

which can be used. So first is called fire crawl.

Uh again very good scraping tool. It has around 50,000

around 50,000 stars. One good thing is that it

stars. One good thing is that it actually takes with this fire crawl tool

actually takes with this fire crawl tool you probably don't even need a PDF

you probably don't even need a PDF extractor tool because it already takes

extractor tool because it already takes the entire website and converts it into

the entire website and converts it into LLM ready markdown or structured data

LLM ready markdown or structured data that's one tool then second is as

that's one tool then second is as someone has mentioned in the chat

someone has mentioned in the chat beautiful soup

if you have HTML pages especially beautiful soup looks at that and fully

beautiful soup looks at that and fully extracts that and another thing is

extracts that and another thing is called puppeteer.

U so it's it's an automation tool. Actually with puppeteer you can with

Actually with puppeteer you can with puppeteer you can do some clever things.

puppeteer you can do some clever things. So someone uh mentioned about named

So someone uh mentioned about named entity recognition right? So what if you

entity recognition right? So what if you want to go through different sections

want to go through different sections but you don't want to you only want to

but you don't want to you only want to take headings or titles from each page.

take headings or titles from each page. How will you do that with a normal web

How will you do that with a normal web scrap web scraper? It's bit difficult to

scrap web scraper? It's bit difficult to do that specific amount of extracting in

do that specific amount of extracting in puppeteer. What you can do is you can

puppeteer. What you can do is you can automate the scraping by telling that I

automate the scraping by telling that I only want font size of this this or a

only want font size of this this or a header tax to be selected. I only want

header tax to be selected. I only want paragraph tags to be selected when you

paragraph tags to be selected when you scrape.

scrape. So that way I think puppeteer you can

So that way I think puppeteer you can also install this as a

also install this as a java javascript library.

java javascript library. Yeah. So it's an API to control Chrome

Yeah. So it's an API to control Chrome or Firefox. It can go through. So my

or Firefox. It can go through. So my question to all of you is this. Let's

question to all of you is this. Let's say if you have if a client has 5,000

say if you have if a client has 5,000 links, you cannot manually go and scrape

links, you cannot manually go and scrape each link, right? You you need an

each link, right? You you need an automation tool which goes through this

automation tool which goes through this link. It scrapes whatever is there. Then

link. It scrapes whatever is there. Then it goes through this link, scrap scrapes

it goes through this link, scrap scrapes whatever is there. Puppeteer provides

whatever is there. Puppeteer provides you that advantage. You can automate an

you that advantage. You can automate an entire workflow through puppeteer and

entire workflow through puppeteer and just sit back and get all the files

just sit back and get all the files downloaded but you need to define that

downloaded but you need to define that workflow very nicely.

Selenium also selenium is good. Uh but Jay I found that puppeteer at least. So

Jay I found that puppeteer at least. So we had a pro client project where we use

we had a pro client project where we use puppeteer. They had around 5,000

puppeteer. They had around 5,000 documents they wanted to be extracted

documents they wanted to be extracted through scraping and manual scraping

through scraping and manual scraping took a long time. So we used puppeteer

took a long time. So we used puppeteer at that moment.

How effective is fire crawl when dealing with websites? Yeah. Yeah. then it's not

with websites? Yeah. Yeah. then it's not I'm not sure actually how it bypasses

I'm not sure actually how it bypasses the authentication

the authentication um of websites

um of websites I need to check that

yeah manual scraping right it takes huge amount of time in fact for the client

amount of time in fact for the client project which I mentioned earlier we

project which I mentioned earlier we were doing manual scraping but it was

were doing manual scraping but it was just too expensive in terms of time and

just too expensive in terms of time and everything

everything is extracting data in tabular format.

is extracting data in tabular format. Yeah, definitely Ashwini. So basically

Yeah, definitely Ashwini. So basically the dockling tool which I mentioned

the dockling tool which I mentioned right it can extract data from anywhere.

right it can extract data from anywhere. It can extract data from images, tabular

It can extract data from images, tabular format, it can extract data from PDF

format, it can extract data from PDF snippets. Basically anything which you

snippets. Basically anything which you want but just keep in mind that if any

want but just keep in mind that if any of you is actually working on an

of you is actually working on an industrial project, sometimes clients

industrial project, sometimes clients don't give you data even in PDF format.

don't give you data even in PDF format. Then you have to do scraping on top of

Then you have to do scraping on top of it.

Okay. Now what we are going to do is that we are going to code the first part

that we are going to code the first part which we just saw. I will take the

which we just saw. I will take the remaining questions in the chat. But

remaining questions in the chat. But first what we will do is that for this

first what we will do is that for this PDF all of us have identified that we

PDF all of us have identified that we will use pi mu PDF.

will use pi mu PDF. Uh yes. Did everyone understand why we

Uh yes. Did everyone understand why we are using py mu PDF for the current

are using py mu PDF for the current task?

task? Can you type yes in the chat if you have

Can you type yes in the chat if you have understood why we are using py mu pdf

understood why we are using py mu pdf for the current project and not any

for the current project and not any other tool. Okay, good. So now our

other tool. Okay, good. So now our coding journey is going to start. I'm

coding journey is going to start. I'm going to share this

going to share this Google collab code file with all of you

Google collab code file with all of you and after the data extraction is done,

and after the data extraction is done, we are going to take a small break. Uh

we are going to take a small break. Uh so I know attention time spans are a bit

so I know attention time spans are a bit less

less but uh no issues.

So this is the Google collab code file and someone has asked to share the PDF

and someone has asked to share the PDF right?

right? Yeah. So for the PDF I actually shared

Yeah. So for the PDF I actually shared this document at the start of the

this document at the start of the lecture itself. not document this drive

lecture itself. not document this drive folder at the start of the lecture

folder at the start of the lecture itself.

Oh yeah. So J uh that's a great point which you mentioned which I I definitely

which you mentioned which I I definitely want to address. I forgot to address

want to address. I forgot to address actually. Uh

actually. Uh so you might be thinking why does PIMU

so you might be thinking why does PIMU PDF actually exist right because it's

PDF actually exist right because it's extremely fast.

extremely fast. Pyu PDF is 10 to 15 times faster than

Pyu PDF is 10 to 15 times faster than dockling. That's the trade-off here. So

dockling. That's the trade-off here. So if you go there are some Reddit trades

if you go there are some Reddit trades which actually argue about this.

Yeah. See dockling is at least 50 times slower than pi mu PDF. So if you have a

slower than pi mu PDF. So if you have a simple text like what we do don't use

simple text like what we do don't use the

the powerful libraries unnecessarily. That

powerful libraries unnecessarily. That will just be very slow for you. But

will just be very slow for you. But that's a good point you bring up. I

that's a good point you bring up. I wanted to touch upon that but it slipped

wanted to touch upon that but it slipped my mind

my mind anyway. So all of you have access to

anyway. So all of you have access to this notebook. Now the first thing which

this notebook. Now the first thing which you have to do is you have to go to

you have to do is you have to go to runtime and you have to switch to T4

runtime and you have to switch to T4 GPU.

GPU. We are going to start very slowly and we

We are going to start very slowly and we are going to start with the data

are going to start with the data injection pipeline. Okay. So before that

injection pipeline. Okay. So before that there is some a long text here which you

there is some a long text here which you can even read after this lecture is

can even read after this lecture is done. I have covered this all in the

done. I have covered this all in the initial portion of the class. This

initial portion of the class. This schematic also I have shared on the

schematic also I have shared on the mirro board. Now what we can do is

mirro board. Now what we can do is directly start from here requirements

directly start from here requirements and setup. So if all of you are

and setup. So if all of you are connected to T4 GPU, this notebook

connected to T4 GPU, this notebook should by the way by default already

should by the way by default already connect you to T4. And then just click

connect you to T4. And then just click on this. So the first two cells are

on this. So the first two cells are where we are installing the packages.

where we are installing the packages. These two steps will take some amount of

These two steps will take some amount of time. So I'm going to wait for here till

time. So I'm going to wait for here till all of you are running this. And

all of you are running this. And meanwhile, let me answer some questions

meanwhile, let me answer some questions in the chat which I might not have seen.

in the chat which I might not have seen. Can you share the PDF? I Yeah, I think I

Can you share the PDF? I Yeah, I think I shared it right now.

shared it right now. I am working on a project where I need

I am working on a project where I need to extract release documents from GitHub

to extract release documents from GitHub pages. Is Puppeteer a good choice? Yeah,

pages. Is Puppeteer a good choice? Yeah, definitely.

definitely. First, Spurs, I would encourage you to

First, Spurs, I would encourage you to explore fire crawl

explore fire crawl because Puppeteer is a very low-level

because Puppeteer is a very low-level library. When I say lowle, it directly

library. When I say lowle, it directly operates at JavaScript. So if you want

operates at JavaScript. So if you want to use puppeteer you need to be very

to use puppeteer you need to be very comfortable with JS code.

comfortable with JS code. Fire crawl abstracts many things. So

Fire crawl abstracts many things. So it's easier to use. If you are

it's easier to use. If you are comfortable with JS then I would suggest

comfortable with JS then I would suggest to go ahead with JS. Sure.

Uh after setting up the data pipeline the biggest challenge I

data pipeline the biggest challenge I faced was keeping the changing data

faced was keeping the changing data synced with the vector database. Any

synced with the vector database. Any suggestion? Great point Prashant. I will

suggestion? Great point Prashant. I will come to this. I do have a suggestion for

come to this. I do have a suggestion for this and in one word the suggestion is

this and in one word the suggestion is to use PG vector

we are going to use PG vector. Essentially the best thing to keep

Essentially the best thing to keep database versus vector database synced

database versus vector database synced is to keep everything in one place.

is to keep everything in one place. That's the only way to do this is to use

That's the only way to do this is to use a Postgress database with PG vector.

a Postgress database with PG vector. We'll see that tomorrow.

We'll see that tomorrow. Where can I get the link to this

Where can I get the link to this notebook? So link I have already shared.

Oh I I shared the copy link. So in this copy file I have removed the hugging

copy file I have removed the hugging face access token. Yeah this is that

face access token. Yeah this is that link.

There is some question in the chat about this book. This uh actually I have not

this book. This uh actually I have not seen this yet.

Let me check this. Yeah, it seems to be very highly cited

Yeah, it seems to be very highly cited especially for vision based uh document

especially for vision based uh document retrieval.

retrieval. One metric which I look for to check how

One metric which I look for to check how popular a tool is is GitHub stars and

popular a tool is is GitHub stars and how active it is on GitHub. So it seems

how active it is on GitHub. So it seems to be quite active. Last commit was made

to be quite active. Last commit was made 5 days back.

5 days back. Um that's a good paper. I'll definitely

Um that's a good paper. I'll definitely add it to my read list.

It is asked to me in an interview if we extracted any using LLM or rag how will

extracted any using LLM or rag how will we validate if they are correct? Again a

we validate if they are correct? Again a very good question. So Prem always

very good question. So Prem always remember that there are two types of

remember that there are two types of validation right.

validation right. There is structural validation and there

There is structural validation and there is semantic validation.

When I say structural validation, it means whether the structure of your

means whether the structure of your retrieved items are correct or not. And

retrieved items are correct or not. And one way to implement structural

one way to implement structural validation which we have already seen in

validation which we have already seen in one of the previous lectures is to use

one of the previous lectures is to use piodantic

piodantic where we can check whether the format is

where we can check whether the format is correct. But to use semant but to do

correct. But to use semant but to do semantic validation there are two types

semantic validation there are two types either human as a judge or LLM as a

either human as a judge or LLM as a judge where if you want to do semantic

judge where if you want to do semantic validation either you have the ground

validation either you have the ground truth data and you validate with that or

truth data and you validate with that or you use a larger LLM to give you the

you use a larger LLM to give you the ground truth data and validate your

ground truth data and validate your extraction with that.

extraction with that. How do you keep track of good papers and

How do you keep track of good papers and make it a habit? Yeah. So that is a bit

make it a habit? Yeah. So that is a bit challenging. So one thing which has

challenging. So one thing which has honestly worked for me amit bit

honestly worked for me amit bit counterintuitive is LinkedIn. My

counterintuitive is LinkedIn. My LinkedIn feed is extremely well curated

LinkedIn feed is extremely well curated and that is also because I spend a lot

and that is also because I spend a lot of time scrolling through LinkedIn and I

of time scrolling through LinkedIn and I read mostly I'm on LinkedIn so I read

read mostly I'm on LinkedIn so I read things which I like. So algorithm picks

things which I like. So algorithm picks up on that. So everything which I get is

up on that. So everything which I get is from people who talk about new things.

from people who talk about new things. Um so I'm following some key set of

Um so I'm following some key set of people who whenever something new is

people who whenever something new is released they will post it.

released they will post it. So mostly I'm trying to avoid flashy

So mostly I'm trying to avoid flashy things on LinkedIn. There are like two

things on LinkedIn. There are like two camps. One camp is like whenever let's

camps. One camp is like whenever let's say context engineering right whenever

say context engineering right whenever context engineering is a thing then

context engineering is a thing then someone will make a post that five

someone will make a post that five reasons why you should learn context

reasons why you should learn context engineering. I avoid those but on my

engineering. I avoid those but on my feed there are people who write about

feed there are people who write about let's say context engineering what are

let's say context engineering what are the papers you should read then how is

the papers you should read then how is it different from so more informative

it different from so more informative and

and not too much flash it's getting a

not too much flash it's getting a challenge for me but I make it a point

challenge for me but I make it a point to at least read two papers per week

to at least read two papers per week and also implement those

I I don't have a two I do have a two read list I'll share it with you. That's

read list I'll share it with you. That's only that I make it week to week. I have

only that I make it week to week. I have it for this week. So in this week's two

it for this week. So in this week's two read list, I have this transfusion

read list, I have this transfusion paper.

This paper is on my to read list. This this week

this week uh

uh and one more thing is on my to read list

and one more thing is on my to read list is the link which I already shared with

is the link which I already shared with you.

you. It's this.

It's this. In fact, I already ordered one of these

In fact, I already ordered one of these books to our office because I'm now

books to our office because I'm now encouraging all of our people to master

encouraging all of our people to master GPU programming. I can't believe they

GPU programming. I can't believe they made this free. It's it's the amazing

made this free. It's it's the amazing but an extremely complex go through of

but an extremely complex go through of how LLM utilize our GPUs.

how LLM utilize our GPUs. But I I like ordering physical books.

But I I like ordering physical books. So, I've ordered two copies of this for

So, I've ordered two copies of this for our office. This is also on my to read

our office. This is also on my to read list. I finished two chapters. I'm going

list. I finished two chapters. I'm going to make a course on this because I have

to make a course on this because I have literally not found a single good course

literally not found a single good course on GPU programming anywhere.

Uh, okay. So, how many of you have finished running until these two steps

finished running until these two steps at the moment?

at the moment? How many of you have finished installing

How many of you have finished installing packages?

packages? You have, right? Okay, good. Now, what

You have, right? Okay, good. Now, what we have to do is that the next step is

we have to do is that the next step is just document processing. So in this

just document processing. So in this part we are going to uh download the

part we are going to uh download the PDF.

PDF. If it does not exist it's fine. So one

If it does not exist it's fine. So one way is to just add it on the left hand

way is to just add it on the left hand side over here. But if it does not exist

side over here. But if it does not exist in this code we'll just go ahead and

in this code we'll just go ahead and download the PDF. And the next code

download the PDF. And the next code block is where we are actually going to

block is where we are actually going to read this PDF. So let's go through this

read this PDF. So let's go through this code block step by step. First there is

code block step by step. First there is a text formatter. So what it will do is

a text formatter. So what it will do is that it will make sure there are not

that it will make sure there are not empty spaces in any of the text which we

empty spaces in any of the text which we are reading. Then we have this open and

are reading. Then we have this open and read PDF. So this import fits which we

read PDF. So this import fits which we are doing right that's the pyu pdf.

are doing right that's the pyu pdf. This py mu pdfdf github repository when

This py mu pdfdf github repository when we do import fits that loads the

we do import fits that loads the package. Um and the way we open a file

package. Um and the way we open a file through pyu pdf is doing fits.open.

through pyu pdf is doing fits.open. Then what we are going to do is that we

Then what we are going to do is that we are going to go through every single

are going to go through every single page in my document. I'm going to get

page in my document. I'm going to get text from that page. So page dot get

text from that page. So page dot get text. Okay. Then what I'm going to do is

text. Okay. Then what I'm going to do is that I'm going to format

that I'm going to format uh this text to remove empty spaces.

uh this text to remove empty spaces. Um and then what I'm going to do I'm

Um and then what I'm going to do I'm going to maintain a list. So I'm going

going to maintain a list. So I'm going to maintain a list like this.

to maintain a list like this. So every page for each page I'm going to

So every page for each page I'm going to store the page number the number of

store the page number the number of characters on that page the word count

characters on that page the word count on that page the number of sentences on

on that page the number of sentences on that page and the actual text.

that page and the actual text. Okay.

Okay. So what this piece of code is doing is

So what this piece of code is doing is that we are maintaining a list called

that we are maintaining a list called pages and text and each element of that

pages and text and each element of that list is a dictionary.

list is a dictionary. So the first element

first element of this list is page one. First element of this list is page one.

First element of this list is page one. And what will page one have? Page one is

And what will page one have? Page one is a dictionary

Similarly, the second element of this is page two etc. So what I'm essentially

page two etc. So what I'm essentially doing is that I'm making a list

and in each list this is page number one. This is page number two dot dot dot

one. This is page number two dot dot dot right up till page number one 208

right up till page number one 208 and in each page

and in each page each page list I'm storing these values.

each page list I'm storing these values. I'm storing the page number. I'm storing

I'm storing the page number. I'm storing the text. Of course, the main thing is

the text. Of course, the main thing is the text also I'm storing.

the text also I'm storing. So you can run this now and then what

So you can run this now and then what you can do is that you can

you can do is that you can just randomly print out two dictionaries

just randomly print out two dictionaries from this list. So I have printed out

from this list. So I have printed out the page number text for one page and

the page number text for one page and this is for second page. So you might be

this is for second page. So you might be wondering why is this minus 41 here,

wondering why is this minus 41 here, right? Why am I subtracting minus 41

right? Why am I subtracting minus 41 over here?

over here? The reason is that if you actually take

The reason is that if you actually take a look at our

a look at our uh book right, it really starts from

uh book right, it really starts from page number 41 or 42 here. This is where

page number 41 or 42 here. This is where our book actually starts. Yeah. Here. So

our book actually starts. Yeah. Here. So what is actually page number one

what is actually page number one is page number. So you need to subtract

is page number. So you need to subtract 42 pages actually to get to page number

42 pages actually to get to page number one.

one. So all of the pages which come before

So all of the pages which come before this are marked as negative since we

this are marked as negative since we subtract 41 and then page number one

subtract 41 and then page number one will rightly start from here.

Uh and then what we can do is that we can

and then what we can do is that we can just get a random sample. So now our

just get a random sample. So now our dictionary or our list is called pages

dictionary or our list is called pages and text right that is our list. We can

and text right that is our list. We can get a random element from this list. So

get a random element from this list. So we can see we have got page number 1019.

we can see we have got page number 1019. The number of characters are 1574.

The number of characters are 1574. Number of words are 270. Oh by the way

Number of words are 270. Oh by the way we are also maintaining number of

we are also maintaining number of tokens. So for this the simple thing we

tokens. So for this the simple thing we are doing is number of characters

are doing is number of characters divided by four.

divided by four. That's the number of tokens which we are

That's the number of tokens which we are assuming.

assuming. So each page dictionary will look

So each page dictionary will look something like this. We have page

something like this. We have page number, the number of characters on that

number, the number of characters on that page, the number of words on that page,

page, the number of words on that page, the number of sentences on that page,

the number of sentences on that page, the actual text. That's it.

Uh and then what you can do is that uh you

and then what you can do is that uh you can actually get some statistics on the

can actually get some statistics on the text. So just run this

text. So just run this and uh get the different statistics. So

and uh get the different statistics. So for example

for example for each so for this page the character

for each so for this page the character count is 29 the word count is four the

count is 29 the word count is four the sentence count is one the page token

sentence count is one the page token count is 7.25

um and then you can get an overall statistics. So this is the main thing

statistics. So this is the main thing which we want to focus on right now.

which we want to focus on right now. Mean let's take a look at this mean row.

Mean let's take a look at this mean row. So on an average all pages have roughly

So on an average all pages have roughly around 198 words. On an average all

around 198 words. On an average all pages have around 10 words, 10 sentences

pages have around 10 words, 10 sentences roughly. And on an average, each page

roughly. And on an average, each page has around 287 words.

Why is this important? Why are we looking at the number of uh tokens on

looking at the number of uh tokens on each page?

each page? Can someone try to think

Can someone try to think why are we looking at the number of

why are we looking at the number of tokens on each page?

There is an error which Krishna has got pages and text is not defined. Krishna

pages and text is not defined. Krishna have you run this

have you run this because we have defined pages and text

because we have defined pages and text over here.

Now I'm going to the whiteboard and the question which I'm asking to all of you

question which I'm asking to all of you is that we got these statistics right?

is that we got these statistics right? We got these statistics

We got these statistics that each page has let's say

yeah so eventually we want to take so let's say we want to take a page and we

let's say we want to take a page and we want to convert the page into an

want to convert the page into an embedding vector

let's say we use this model all MP net base V2.

base V2. The issue is that

The issue is that in very fine print they have mentioned

in very fine print they have mentioned that input text longer than 384 word

that input text longer than 384 word pieces is truncated.

pieces is truncated. So that is going to be an issue for us.

So that is going to be an issue for us. If our page is more than 384 or 400

If our page is more than 384 or 400 words, we cannot embed our entire page

words, we cannot embed our entire page into a vector using this model because

into a vector using this model because then some information will unfortunately

then some information will unfortunately be lost.

So that's why just a better idea to make sure that

sure that whenever you're looking at pages just

whenever you're looking at pages just take a look at okay how many words do

take a look at okay how many words do they have on an average how many tokens

they have on an average how many tokens do they have on average. So here it

do they have on average. So here it seems that each page on an average is

seems that each page on an average is 287 tokens which is lesser than 384

287 tokens which is lesser than 384 right. So it is fine to go ahead with

right. So it is fine to go ahead with this. So potentially each page can be

this. So potentially each page can be embedded with embedding models.

embedded with embedding models. Currently we have not decided which

Currently we have not decided which embedding model to use. We have not even

embedding model to use. We have not even decided if one page is equal to one

decided if one page is equal to one chunk. But potentially if we decide that

chunk. But potentially if we decide that one page is one chunk and we want to

one page is one chunk and we want to embed each page, we can very safely use

embed each page, we can very safely use this allimpinate base version two.

this allimpinate base version two. That's the reason why we should actually

That's the reason why we should actually keep a track of how many tokens are

keep a track of how many tokens are there on each page, how many words are

there on each page, how many words are there on each page. The thing is when

there on each page. The thing is when you directly use rag libraries on lang

you directly use rag libraries on lang chain, all of this information is lost

chain, all of this information is lost to you. They directly give you a PDF but

to you. They directly give you a PDF but you should yourself see how many pages

you should yourself see how many pages are there what's the token count on each

are there what's the token count on each page what's the word count on each page

page what's the word count on each page what's the sentence count on each page

what's the sentence count on each page etc

etc we are going to come to chunking right

we are going to come to chunking right now so don't worry about it the next

now so don't worry about it the next thing which we are going to do is

thing which we are going to do is chunking Rahul has asked a question is

chunking Rahul has asked a question is rag plus SLM a practical combination

rag plus SLM a practical combination yeah definitely

yeah definitely because uh rags are much better than

because uh rags are much better than fine-tuning. Anyways, we'll come to that

fine-tuning. Anyways, we'll come to that actually after the lecture is done

actually after the lecture is done that is mean. What about max? Sure, max.

that is mean. What about max? Sure, max. But check the standard deviation also,

But check the standard deviation also, right? Standard deviation is 140. So,

right? Standard deviation is 140. So, even with this, it seems the one or two

even with this, it seems the one or two standard deviations are around let's say

standard deviations are around let's say 400 token length or something. So, we

400 token length or something. So, we are fine.

Sorry, I did not understand what you mean about lang chain. So uh when you

mean about lang chain. So uh when you see tutorials of rag on lang chain llama

see tutorials of rag on lang chain llama index these tutorials are 10 to 15

index these tutorials are 10 to 15 minute long and they completely skip

minute long and they completely skip this part. They already assume that you

this part. They already assume that you have a PDF and then everything starts

have a PDF and then everything starts like at a much later stage. But in

like at a much later stage. But in practice this is what you have to do

practice this is what you have to do first. So this is the exploratory data

first. So this is the exploratory data analysis equivalent. When we do a normal

analysis equivalent. When we do a normal machine learning problem we do EDA

machine learning problem we do EDA right? You also need to do some EDA when

right? You also need to do some EDA when you do rag.

There is a question about lecture recording. So I I I will share the

recording. So I I I will share the lecture recording and the Google collab.

lecture recording and the Google collab. I have already shared it on chat.

Okay. So now we are going to take a break for some time and we are going to

break for some time and we are going to cover chunking. I definitely do want to

cover chunking. I definitely do want to cover chunking today because uh it is

cover chunking today because uh it is one of the most important pieces of the

one of the most important pieces of the puzzle and nowhere

puzzle and nowhere on the internet on any YouTube video I

on the internet on any YouTube video I found comprehensive

found comprehensive uh explanations of chunking. In fact,

uh explanations of chunking. In fact, there are blogs on chunking. There are

there are blogs on chunking. There are good blogs

good blogs but blogs can only take you so far.

but blogs can only take you so far. Right? In the chunking

Right? In the chunking uh section what I'm going to do is that

uh section what I'm going to do is that first we are going to understand all the

first we are going to understand all the types of chunking in detail and then we

types of chunking in detail and then we are actually going to code different

are actually going to code different chunking strategies. We are going to

chunking strategies. We are going to code them from scratch and we are going

code them from scratch and we are going to compare these different chunking

to compare these different chunking strategies with each other.

strategies with each other. Um but we will take a break. So earlier

Um but we will take a break. So earlier what I had planned is that I planned one

what I had planned is that I planned one and a half hour for today and one and a

and a half hour for today and one and a half hour for tomorrow. But I think

half hour for tomorrow. But I think today

today today itself we will take around 2 and a

today itself we will take around 2 and a half hours it looks like. So if any of

half hours it looks like. So if any of you uh I did not plan three-hour

you uh I did not plan three-hour workshop today Sanjay honestly but uh

workshop today Sanjay honestly but uh it's good that you are asking so many

it's good that you are asking so many questions

questions we have many more things left to cover.

we have many more things left to cover. So it depends on your time schedule. If

So it depends on your time schedule. If all of you want to catch the recording

all of you want to catch the recording you can do that. But anyways, I will

you can do that. But anyways, I will come back now after 5 minutes

come back now after 5 minutes uh to start the chunking part. If you

uh to start the chunking part. If you are available, you can stay stay live to

are available, you can stay stay live to watch the chunking. If not, I'm going to

watch the chunking. If not, I'm going to upload the lecture recording anyways.

Uh yeah, Samrat, when we do chunking, it's

yeah, Samrat, when we do chunking, it's not

not not we don't need the EDA really later

not we don't need the EDA really later when we do the chunking, but it's still

when we do the chunking, but it's still good to see what's the number of tokens

good to see what's the number of tokens we have. It might change our intuition

we have. It might change our intuition later.

later. Okay. So I'll just come back after 4 to

Okay. So I'll just come back after 4 to 5 minutes. It might take 1 to one and a

5 minutes. It might take 1 to one and a half more hours today because so what we

half more hours today because so what we can do today we can finish chunking and

can do today we can finish chunking and then tomorrow we can do embedding the

then tomorrow we can do embedding the LLM and then the final production.

LLM and then the final production. Yeah. Thanks guys. I'll I'll come back

Yeah. Thanks guys. I'll I'll come back after after around maybe 9:35.

Um all right everyone. So let's begin with the next part of today's lecture

with the next part of today's lecture which is going to be chunking.

uh there is a reason why I have allocated a separate section to this

allocated a separate section to this because

because I believe it is one of the most

I believe it is one of the most important pieces of the rack pipeline.

important pieces of the rack pipeline. Um let me explain to all of you why

Um let me explain to all of you why chunking is important. So until now what

chunking is important. So until now what we have seen is that we have processed

we have seen is that we have processed the

the uh we have processed the PDF.

So we have processed the PDF that part is done.

is done. Okay. Now what we have to do is we

Okay. Now what we have to do is we finally this is our LLM.

finally this is our LLM. The LLM will get a prompt from the user

The LLM will get a prompt from the user of course but the LLM will also get some

of course but the LLM will also get some retrieved information.

The retrieved information from our knowledge base or from the PDF.

Now what we are doing in the chunking section is that we are essentially

section is that we are essentially bridging this gap.

bridging this gap. We are essentially bridging the gap. So

We are essentially bridging the gap. So now we have processed the PDF. How do we

now we have processed the PDF. How do we go from this PDF to retrieving bits of

go from this PDF to retrieving bits of information which are important and

information which are important and there are two key steps to this. The

there are two key steps to this. The first is chunking

first is chunking and the second is called as embedding.

and the second is called as embedding. We are going to look at embedding

We are going to look at embedding tomorrow but today let's cover chunking.

tomorrow but today let's cover chunking. So the way it works is that if let's say

So the way it works is that if let's say let me take a sample

The first thing which I'm going to do is that I'm going to divide this PDF into

that I'm going to divide this PDF into chunks.

chunks. And uh when I say chunk, a chunk can be

And uh when I say chunk, a chunk can be let's say these are my chunks.

This can be one type of chunking. So imagine that in the whole PDF every

imagine that in the whole PDF every sentence is one chunk or you can even

sentence is one chunk or you can even have chunking which is page page level

have chunking which is page page level chunking. So this entire page is one

chunking. So this entire page is one chunk. This entire page is another chunk

chunk. This entire page is another chunk etc.

etc. Now let's say you do some sort of

Now let's say you do some sort of chunking and you have these chunks.

You have 2,000 chunks. Let's say you have split the documents. You have split

have split the documents. You have split the knowledge base or the document which

the knowledge base or the document which you have into 2,000 chunks.

you have into 2,000 chunks. In the retrieved information,

In the retrieved information, the only portion which you are going to

the only portion which you are going to select is some of these chunks. Maybe

select is some of these chunks. Maybe you select the chunks which are most

you select the chunks which are most closely related to the prompt. The top

closely related to the prompt. The top chunk. So you can select this one chunk

chunk. So you can select this one chunk or you might select top three chunks

or you might select top three chunks which are most closely related to the

which are most closely related to the prompt.

prompt. So you can select one chunk, you can

So you can select one chunk, you can select two chunks, you can select three

select two chunks, you can select three chunks that you have to decide. But if

chunks that you have to decide. But if normally people select between 1 to 10

normally people select between 1 to 10 chunks. So let's say you select three

chunks. So let's say you select three chunks. These are the three chunks which

chunks. These are the three chunks which will be passed as the retrieved

will be passed as the retrieved information.

information. Now you see the problem here is that or

Now you see the problem here is that or I should not call problem.

I should not call problem. Your the quality of your output is going

Your the quality of your output is going to completely and solely depend on your

to completely and solely depend on your retrieved information and your retrieved

retrieved information and your retrieved information is going to completely

information is going to completely defend depend on what type of chunks you

defend depend on what type of chunks you have. Because if you have granular

have. Because if you have granular chunking like sentences, this will be

chunking like sentences, this will be just one sentence. This will be second

just one sentence. This will be second sentence and this will be third

sentence and this will be third sentence. So you'll just pass three

sentence. So you'll just pass three sentences. But if you have broad level

sentences. But if you have broad level chunking like pages then each chunk will

chunking like pages then each chunk will be one page. So you'll be passing page

be one page. So you'll be passing page one, you'll be passing page two and

one, you'll be passing page two and you'll be passing page three.

you'll be passing page three. So

So imagine this as the brain of the LLM and

imagine this as the brain of the LLM and uh so this is the LLM and this is the

uh so this is the LLM and this is the data.

data. the retrieved information which passes

the retrieved information which passes through the LLM will be from a list of

through the LLM will be from a list of chunks and only a subset of these chunks

chunks and only a subset of these chunks will be passed to the LLM. So from the

will be passed to the LLM. So from the engineer's perspective it becomes

engineer's perspective it becomes extremely important to decide how are we

extremely important to decide how are we exactly going to do the chunking. There

exactly going to do the chunking. There are so many ways right the the sky is

are so many ways right the the sky is completely open that we can do anything.

completely open that we can do anything. So now let me ask all of you. Let's say

So now let me ask all of you. Let's say this is the PDF

this is the PDF U 1,28 pages PDF. How should we go about

U 1,28 pages PDF. How should we go about chunking?

chunking? What will you have as individual chunks?

What will you have as individual chunks? Heading wise. So Samrat is saying

Heading wise. So Samrat is saying heading wise, right? U so essentially I

heading wise, right? U so essentially I think what Samrat is saying that

think what Samrat is saying that wherever there are headings

wherever there are headings you make that as one chunk. So make this

you make that as one chunk. So make this as one chunk. So if carbohydrates is a

as one chunk. So if carbohydrates is a heading, make carbohydrate section as

heading, make carbohydrate section as one chunk. If lipids is a heading, make

one chunk. If lipids is a heading, make this section as one chunk. If proteins

this section as one chunk. If proteins is a heading, make this section as one

is a heading, make this section as one chunk. Um I think that's what Dishant

chunk. Um I think that's what Dishant means by sections.

means by sections. Aditya has an interesting suggestion.

Aditya has an interesting suggestion. What Aditya is saying is that let me not

What Aditya is saying is that let me not focus on the structure of the PDF. I I

focus on the structure of the PDF. I I will actually write down all of your

will actually write down all of your suggestions over here. So the first

suggestions over here. So the first suggestion is based on the structure

suggestion is based on the structure right. So if we do

right. So if we do based on headings

um what what else then the suggestion by Adita is with respect to similar topics

Adita is with respect to similar topics or semantics

then JP document structure. So here let me

me bucket this in this segment itself and

bucket this in this segment itself and let me call this as a document structure

let me call this as a document structure at the moment.

at the moment. If they are bit big divided into

If they are bit big divided into paragraph limited amount of word size

paragraph limited amount of word size has to be the same maximum number of

has to be the same maximum number of tokens LLM can handle. So let me do the

tokens LLM can handle. So let me do the third category as fixed.

third category as fixed. So when I say fixed maybe it's 10

So when I say fixed maybe it's 10 sentences as one token

sentences as one token or 10 words as one token

or 10 words as one token or one word as one token

or one word as one token whatever this is fixed size chunking. So

whatever this is fixed size chunking. So intuitively if this terminology of

intuitively if this terminology of chunking was not known to me

chunking was not known to me or if I had not studied the literature

or if I had not studied the literature of lit retrieval augmented generation I

of lit retrieval augmented generation I would have intuitively said that one

would have intuitively said that one chunk is one section because when I read

chunk is one section because when I read a piece of PDF my mind thinks in terms

a piece of PDF my mind thinks in terms of sections right so if a certain

of sections right so if a certain question is asked by the user ideally

question is asked by the user ideally you should retrieve a full section and

you should retrieve a full section and give it as the answer right I don't want

give it as the answer right I don't want to just retrieve few sentences.

to just retrieve few sentences. I I want to retrieve entire sections and

I I want to retrieve entire sections and pass it. That's why I think chunking

pass it. That's why I think chunking should be done section wise.

should be done section wise. That can be one example. There are

That can be one example. There are people who are mentioning recursive work

people who are mentioning recursive work plus overlap. For some people this might

plus overlap. For some people this might not be clear. So I'll come to that

not be clear. So I'll come to that eventually.

eventually. Um okay. So that's the intuition which

Um okay. So that's the intuition which comes to my mind.

comes to my mind. Now what we can do is that let's go

Now what we can do is that let's go through the five types of chunking which

through the five types of chunking which we are going to see. Um and then towards

we are going to see. Um and then towards the end we will also have an engineer

the end we will also have an engineer choice section on which chunking

choice section on which chunking strategy to use and then we will code

strategy to use and then we will code the different chunking strategies and

the different chunking strategies and actually see their similarities and

actually see their similarities and differences. So my hope is that after

differences. So my hope is that after this section all of you should

this section all of you should understand the trade-offs. So at the

understand the trade-offs. So at the start of the lecture I mentioned about

start of the lecture I mentioned about trade-offs right? There are a lot of

trade-offs right? There are a lot of trade-offs in different chunking

trade-offs in different chunking strategies. There is no one-sizefits-all

strategies. There is no one-sizefits-all approach and uh different chunking

approach and uh different chunking strategies definitely lead to different

strategies definitely lead to different results.

results. In fact, what we have done here is that

we have actually made a PDF. I'm trying to find that PDF right now. Just a

to find that PDF right now. Just a minute.

Yeah. So within our company, we have made this PDF of chunking strategies.

made this PDF of chunking strategies. I'll share this with all of you

I'll share this with all of you where this guide is especially meant for

where this guide is especially meant for what are the different type of chunking

what are the different type of chunking strategies and which chunking strategy

strategies and which chunking strategy to use at what time. This is one of the

to use at what time. This is one of the most important things to understand for

most important things to understand for engineers especially and my main purpose

engineers especially and my main purpose with this workshop is how to make

with this workshop is how to make engineering decisions like this. But to

engineering decisions like this. But to make engineering decisions like this

make engineering decisions like this first we have to understand what are the

first we have to understand what are the different chunking strategies right

different chunking strategies right u so let's let's start now before even

u so let's let's start now before even evaluating between different chunking

evaluating between different chunking strategies or coding all of you need to

strategies or coding all of you need to understand what is exactly done some

understand what is exactly done some chunking strategies are easy to

chunking strategies are easy to understand some are slightly more

understand some are slightly more detailed but each of them serve a

detailed but each of them serve a specific purpose so first let's go with

specific purpose so first let's go with fixed size chunking so in fixed size

fixed size chunking so in fixed size chunking what is actually done is that

chunking what is actually done is that let's say let's actually take a PDF

let's say let's actually take a PDF let's take this legal services agreement

let's take this legal services agreement let's say you are making a rack system

let's say you are making a rack system for a legal domain right and if you have

for a legal domain right and if you have a PDF which looks like this

a PDF which looks like this where there is responsibilities of law

where there is responsibilities of law firm and client whatever in fixed

firm and client whatever in fixed chunking strategies what you mention is

chunking strategies what you mention is that I will uh have every chunk to be of

that I will uh have every chunk to be of a fixed size so let's say my chunk

a fixed size so let's say my chunk is of uh

is of uh let's say 200 words. So all my chunks

let's say 200 words. So all my chunks are going to be 200 words. I'm not going

are going to be 200 words. I'm not going to look at anything else. My chunk one

to look at anything else. My chunk one is going to be 200 words. My chunk two

is going to be 200 words. My chunk two is going to be 200 words. That's it. And

is going to be 200 words. That's it. And I can also have a slight overlap between

I can also have a slight overlap between these chunks so as to make sure that

these chunks so as to make sure that some amount of context is retained.

some amount of context is retained. But can you tell me what's the drawback

But can you tell me what's the drawback with this approach? What's the

with this approach? What's the advantages and what's the disadvantages

advantages and what's the disadvantages with this approach according to you? So

with this approach according to you? So now here again try to think from first

now here again try to think from first principles right imagine you are making

principles right imagine you are making this rag system where you want to make a

this rag system where you want to make a chatbot where a customer asks something

chatbot where a customer asks something about an agreement and your chatbot

about an agreement and your chatbot should answer. Now the retrieved

should answer. Now the retrieved information will come in chunks. Why or

information will come in chunks. Why or why not should you go ahead with a fixed

why not should you go ahead with a fixed chunking strategy like this with each

chunking strategy like this with each chunk being of 200 words

chunk being of 200 words incomplete responses

incomplete responses uh context not called sentence cut in

uh context not called sentence cut in between

between lacks contextual overlap.

lacks contextual overlap. So yeah let's take a look at this

So yeah let's take a look at this example itself right where it is being

example itself right where it is being cut. So responsibilities of law firm and

cut. So responsibilities of law firm and client this should ideally be one full

client this should ideally be one full section right and I want this entire

section right and I want this entire thing to be passed into my retrieved

thing to be passed into my retrieved information but because of this chunking

information but because of this chunking what has happened is that let's say this

what has happened is that let's say this chunk has responsibilities of law firm

chunk has responsibilities of law firm and client right so when a user asks on

and client right so when a user asks on a chatbot what are the responsibilities

a chatbot what are the responsibilities of a law firm and client if this is the

of a law firm and client if this is the user asks this this chunk will be

user asks this this chunk will be retrieved

retrieved But this chunk actually does not have

But this chunk actually does not have anything related to it has some amount

anything related to it has some amount of context because we are retaining some

of context because we are retaining some overlap but most of it is related to

overlap but most of it is related to some other sections. So this chunk will

some other sections. So this chunk will not be retrieved which means that we are

not be retrieved which means that we are actually losing out on this this much

actually losing out on this this much amount of information which is

amount of information which is completely relevant to our current

completely relevant to our current section.

That's one major disadvantage with fixed size chunking. Chunks can be made in the

size chunking. Chunks can be made in the middle of important paragraphs. Chunks

middle of important paragraphs. Chunks can be made in the middle of sentences.

can be made in the middle of sentences. A good question is asked, won't

A good question is asked, won't embeddings create a match for similar

embeddings create a match for similar text? Embeddings will create a match.

text? Embeddings will create a match. But what if your chunk is formed at a

But what if your chunk is formed at a place where

place where there is nothing with respect to the

there is nothing with respect to the question which is asked. Let's say if

question which is asked. Let's say if it's just two sentences at the end of a

it's just two sentences at the end of a paragraph where the context of what

paragraph where the context of what comes before is lost. If your chunk

comes before is lost. If your chunk unluckily comes at that point where in

unluckily comes at that point where in that particular section the information

that particular section the information of the title is lost

of the title is lost then that paragraph won't be retrieved

then that paragraph won't be retrieved and currently I'm just showing a small

and currently I'm just showing a small paragraph right if you have a huge

paragraph right if you have a huge paragraph related to a section and if

paragraph related to a section and if you randomly make a chunk halfway some

you randomly make a chunk halfway some of your information can be lost in the

of your information can be lost in the retrieved chunks.

Uh can the chunks be linked? So the chunks cannot be linked because let's

chunks cannot be linked because let's say we have chunks.

say we have chunks. Uh when you say chunks linked that leads

Uh when you say chunks linked that leads to structural chunking actually which

to structural chunking actually which will come later. In fixedsize chunking

will come later. In fixedsize chunking this is the main issue. So then why

this is the main issue. So then why would anyone do fixed size chunking? Can

would anyone do fixed size chunking? Can can you think of an application where

can you think of an application where people do fixed size chunking?

So one one lesson which all of us learned at the moment is that if your

learned at the moment is that if your document has a structure like sections,

document has a structure like sections, subsections etc. Never go with fixed

subsections etc. Never go with fixed size chunking because it might cut a

size chunking because it might cut a section halfway. Fixed size chunking is

section halfway. Fixed size chunking is used in places where you want fast

used in places where you want fast processing. Let's say if you have

processing. Let's say if you have millions and billions of documents,

millions and billions of documents, right? um or hundreds of thousands of

right? um or hundreds of thousands of documents. And if you want a quick

documents. And if you want a quick strategy without too much overhead, if

strategy without too much overhead, if you want the speed of processing to be

you want the speed of processing to be quick, then you go ahead with a fixed

quick, then you go ahead with a fixed size junking

size junking because it will just be very fast. Like

because it will just be very fast. Like if you are collecting information from

if you are collecting information from Reddit, if you are collecting

Reddit, if you are collecting information from Twitter, mostly the

information from Twitter, mostly the information will be disorganized in

information will be disorganized in threads, in comments, uh no clear

threads, in comments, uh no clear structure, no clear subheading, random

structure, no clear subheading, random messy information but huge amount of

messy information but huge amount of information. If you have random messy

information. If you have random messy chaotic information which is huge in

chaotic information which is huge in number and if you want to process it

number and if you want to process it quickly, you can use chunking uh you can

quickly, you can use chunking uh you can use fixed size chunking with some

use fixed size chunking with some overlap.

overlap. Now these are the advantages and

Now these are the advantages and disadvantages of fixed size chunking.

disadvantages of fixed size chunking. Quick fast processing is the advantage.

Quick fast processing is the advantage. The disadvantage is that it has semantic

The disadvantage is that it has semantic breaks and the context is lost.

breaks and the context is lost. Uh so the strategy is best used in

Uh so the strategy is best used in scenarios where documents are large and

scenarios where documents are large and numerous and a quick segmentation is

numerous and a quick segmentation is needed without requiring deep

needed without requiring deep understanding of the context.

understanding of the context. uh for instance if you are processing

uh for instance if you are processing millions of web pages for indexing and

millions of web pages for indexing and can tolerate some loss of coherence in

can tolerate some loss of coherence in chunks fixed size chunking is a viable

chunks fixed size chunking is a viable approach.

approach. Now remember that as the size of your

Now remember that as the size of your chunk increases your embedding model

chunk increases your embedding model size also needs to increase

size also needs to increase proportionately.

proportionately. So keep that in mind that's a trade-off

So keep that in mind that's a trade-off with larger chunks.

One other use may be in streaming or sequential processing. Yeah, correct. As

sequential processing. Yeah, correct. As it's easy to handle streams of text

it's easy to handle streams of text without worrying about sentence breaks.

without worrying about sentence breaks. Agreed. Another use is book like

Agreed. Another use is book like fountain head. Yeah, sure. if you have

fountain head. Yeah, sure. if you have books or u let's say if I go to

yeah take a look at this book right it's a huge book which with which has no

a huge book which with which has no structure it has no no headings no

structure it has no no headings no subheadings

subheadings such kind of text it might be good idea

such kind of text it might be good idea to maybe go ahead with fixed size

to maybe go ahead with fixed size chunking and if you have thousand such

chunking and if you have thousand such books then definitely go ahead with

books then definitely go ahead with fixed size chunking so so let's say if

fixed size chunking so so let's say if you're doing a project on project

you're doing a project on project Gutenberg

Gutenberg and your task is to transcribe all the

and your task is to transcribe all the books let's say and come up with some

books let's say and come up with some sort of a rack system it might be better

sort of a rack system it might be better to go ahead with fixed size chunking

to go ahead with fixed size chunking okay that's the first strategy The

okay that's the first strategy The second strategy is what someone already

second strategy is what someone already mentioned in the chat. Now again here

mentioned in the chat. Now again here I'm taking the same example which I

I'm taking the same example which I showed you over here.

Now let's say you take a book from here the same book

you take a book from here the same book which we saw. Um the main issue again

which we saw. Um the main issue again with fixed size chunking is that

with fixed size chunking is that although it's fast it does not retain

although it's fast it does not retain anything about semantics. Right?

anything about semantics. Right? It does not retain anything about

It does not retain anything about meaning. There is no meaning between one

meaning. There is no meaning between one chunk or another chunk. Semantic

chunk or another chunk. Semantic chunking tries to solve this issue. So

chunking tries to solve this issue. So the way semantic chunking works is that

the way semantic chunking works is that first you have to define a level of

first you have to define a level of organization.

organization. So by level of organization,

So by level of organization, I mean whether it's at a sentence level.

I mean whether it's at a sentence level. So if I want a sentence level

So if I want a sentence level organization, what I will do is that I

organization, what I will do is that I will take the first sentence, right? So

will take the first sentence, right? So let's say you have chunk number one.

let's say you have chunk number one. And that's like a box.

And that's like a box. I will take my first sentence. I will

I will take my first sentence. I will add it to the box. Okay. Then what I

add it to the box. Okay. Then what I will do is that I will take my second

will do is that I will take my second sentence.

sentence. I will

I will compare the embedding of this sentence

compare the embedding of this sentence and this sentence. So I will so let's

and this sentence. So I will so let's say this is sentence one and sentence

say this is sentence one and sentence two. Um

two. Um so sentence one will be converted into a

so sentence one will be converted into a vector embedding.

Sentence two will be converted into a vector embedding and I will check if the

vector embedding and I will check if the similarity score between these two

similarity score between these two vector embeddings is greater than a

vector embeddings is greater than a threshold. Let's say 8.

threshold. Let's say 8. If it's greater than this threshold, I

If it's greater than this threshold, I know that both of these sentences are

know that both of these sentences are kind of meaning the same. So then I will

kind of meaning the same. So then I will add this second sentence also over here

add this second sentence also over here because it passes my similarity

because it passes my similarity criteria. Then what I will do is that I

criteria. Then what I will do is that I will again go to the third sentence.

will again go to the third sentence. I will embed the third sentence into a

I will embed the third sentence into a vector and I will compare its cosine

vector and I will compare its cosine similarity with sentence number one. If

similarity with sentence number one. If it again passes the threshold, I will

it again passes the threshold, I will add it to my chunk.

add it to my chunk. I will keep on doing this until I have

I will keep on doing this until I have sentences which have good cosine

sentences which have good cosine similarity with my original sentence.

similarity with my original sentence. And the moment I encounter a sentence

And the moment I encounter a sentence which does not pass this criteria, the

which does not pass this criteria, the moment I encounter a sentence whose

moment I encounter a sentence whose cosign similarity is less than this

cosign similarity is less than this threshold, I will stop this chunk.

threshold, I will stop this chunk. So that's my chunk one. It's done. Then

So that's my chunk one. It's done. Then I move to chunk number two.

I move to chunk number two. What this will ensure is that every

What this will ensure is that every chunk will have semantically similar

chunk will have semantically similar information.

information. So let's say when I go to this, let's

So let's say when I go to this, let's say I want the initial section is all

say I want the initial section is all about a drama happening between a

about a drama happening between a family. I want to have this chunk until

family. I want to have this chunk until that drama finishes. So whenever certain

that drama finishes. So whenever certain question is asked, I will only retrieve

question is asked, I will only retrieve the chunk whose semantic meaning is

the chunk whose semantic meaning is matching.

matching. That's where semantic chunking actually

That's where semantic chunking actually has an advantage over fixed size

has an advantage over fixed size chunking. It takes into account the

chunking. It takes into account the meaning. So I know that every chunk will

meaning. So I know that every chunk will have similarity in meaning.

Amit is asking what is chunking strategy used in notebook LM? Notebook LM

used in notebook LM? Notebook LM definitely uses I think chunking which

definitely uses I think chunking which takes semantics into account. So maybe

takes semantics into account. So maybe something similar to the semantic

something similar to the semantic chunking which we are looking at right

chunking which we are looking at right now.

now. If the sentences have sim high

If the sentences have sim high similarity, isn't it better to avoid

similarity, isn't it better to avoid them? Um so good question but you will

them? Um so good question but you will never be sure that why the similarity is

never be sure that why the similarity is high right so you might lose information

high right so you might lose information that way instead

that way instead two sentences can mean something similar

two sentences can mean something similar but they are in different contexts you

but they are in different contexts you can still have them so let's say if

can still have them so let's say if you're talking about forests you can be

you're talking about forests you can be talking about trees in the forest or you

talking about trees in the forest or you can be talking about taking a trip to

can be talking about taking a trip to the forest and there semantics maybe the

the forest and there semantics maybe the vector embeddings are matching so you

vector embeddings are matching so you won't want to neglect one compared to

won't want to neglect one compared to the other. Right?

the other. Right? So semantic chunking main advantage is

So semantic chunking main advantage is of course it maintains coherence and it

of course it maintains coherence and it is used

is used it is used in settings where integrity

it is used in settings where integrity of ideas is very important. So for

of ideas is very important. So for example let's say if you have uh let's

example let's say if you have uh let's say if you are listening to a parliament

say if you are listening to a parliament debate.

Let's say uh you are listening to a parliament debate.

parliament debate. and you have collected the transcripts

and you have collected the transcripts right you want to make a rag system

right you want to make a rag system related to uh you ask the question and

related to uh you ask the question and then you want to identify what was

then you want to identify what was discussed in this parliamentary debate

discussed in this parliamentary debate now usually I don't know if you have

now usually I don't know if you have seen but parliamentary debates are some

seen but parliamentary debates are some of the most unstructured and they can

of the most unstructured and they can get chaotic they can get messy

get chaotic they can get messy and but there is a flow of ideas there

and but there is a flow of ideas there is a flow of ideas in these debates

is a flow of ideas in these debates someone talks something someone else

someone talks something someone else negates it. Usually we don't know till

negates it. Usually we don't know till what time that negation proceeds or

what time that negation proceeds or let's say we don't have a clear split

let's say we don't have a clear split but ideas are there and clearly ideas

but ideas are there and clearly ideas belong in buckets

belong in buckets for such kind of transcript

for such kind of transcript rack system I would go ahead with

rack system I would go ahead with semantic chunking because I would want

semantic chunking because I would want to preserve the integrity of an idea

to preserve the integrity of an idea till the time it is discussed in one

till the time it is discussed in one chunk.

chunk. This is very sim very similar to

This is very sim very similar to educational transcripts. Let's say if

educational transcripts. Let's say if you watch a video and if you make a

you watch a video and if you make a transcript out of this, right? So let's

transcript out of this, right? So let's say

say this same video.

Let's say you watch this video and I talk about four to five things in

and I talk about four to five things in the video. But let's say I have not

the video. But let's say I have not added timestamps and I have not added

added timestamps and I have not added anything. How will you know the key

anything. How will you know the key things which are discussed in the video?

things which are discussed in the video? The only way for you to know is maintain

The only way for you to know is maintain the semantic integrity of chunks. Right?

the semantic integrity of chunks. Right? You cannot do fixed size chunking here.

You cannot do fixed size chunking here. You have to maintain semantic

You have to maintain semantic similarity. So then you will know okay

similarity. So then you will know okay this section talks about bite pair

this section talks about bite pair encoding. This section talks about the

encoding. This section talks about the size of language models. This section

size of language models. This section talks about emergent properties etc.

talks about emergent properties etc. Otherwise there is no way to know from

Otherwise there is no way to know from transcripts. So there are a number of

transcripts. So there are a number of cases when maintaining

the the semantic similarity in one chunk plays

semantic similarity in one chunk plays to our advantage.

to our advantage. And again the drawback of course there

And again the drawback of course there is no free lunch. there is no free lunch

is no free lunch. there is no free lunch and that's why the main drawback is that

and that's why the main drawback is that this kind of a strategy is extremely

this kind of a strategy is extremely complex and it takes a lot of

complex and it takes a lot of computational power because you have to

computational power because you have to convert every single sentence into an

convert every single sentence into an embedding right so that's not easy again

embedding right so that's not easy again major issue is you have a hyperparameter

major issue is you have a hyperparameter here which is the threshold you have a

here which is the threshold you have a hyperparameter here also the number of

hyperparameter here also the number of tokens in a chunk but here you have the

tokens in a chunk but here you have the threshold and you have no clue what this

threshold and you have no clue what this threshold sensitivity should

threshold sensitivity should uh at least this hyperparameter you kind

uh at least this hyperparameter you kind of have an idea that 200 words means

of have an idea that 200 words means this much but here it's completely vague

this much but here it's completely vague another thing is in inconsistent chunk

another thing is in inconsistent chunk sizes so some chunk sizes might be very

sizes so some chunk sizes might be very huge that might be an issue for our LLM

huge that might be an issue for our LLM context etc

context etc let me see if there are any questions in

let me see if there are any questions in the chat

the chat you took one sentence at a time and then

you took one sentence at a time and then I lost how semantics is maintained Do

I lost how semantics is maintained Do you scan the entire document? Yeah. So

you scan the entire document? Yeah. So basically the way it is done samarat is

basically the way it is done samarat is that it done sentence wise. So you take

that it done sentence wise. So you take this sentence number one. Okay you are

this sentence number one. Okay you are doing sentence by sentence. So you take

doing sentence by sentence. So you take the sentence number one you add it to a

the sentence number one you add it to a chunk. You keep on adding sentence

chunk. You keep on adding sentence subsequent sentences to the same chunk

subsequent sentences to the same chunk until their cosine similarity with the

until their cosine similarity with the first chunk with the first sentence is

first chunk with the first sentence is above a certain value. The moment you

above a certain value. The moment you encounter a sentence whose cosign

encounter a sentence whose cosign similarity is not

similarity is not higher than the threshold right from

higher than the threshold right from that moment you start forming the second

that moment you start forming the second chunk then you form the third chunk that

chunk then you form the third chunk that like that you sequentially go through

like that you sequentially go through your entire text and keep on forming

your entire text and keep on forming chunks.

chunks. Does semantic chunking require

Does semantic chunking require premputing embeddings? Is it done at

premputing embeddings? Is it done at runtime? There are both options actually

runtime? There are both options actually uh

uh nowadays actually people have started

nowadays actually people have started using runtime querying so you can do

using runtime querying so you can do that during runtime but most rag

that during runtime but most rag applications I have seen they maintain

applications I have seen they maintain embeddings

embeddings what happens if the idea in chunk one

what happens if the idea in chunk one and again come up somewhere after that's

and again come up somewhere after that's a great question actually uh yeah then

a great question actually uh yeah then unfortunately that needs to be a

unfortunately that needs to be a separate chunk but if the idea is close

separate chunk but if the idea is close to the first idea and if you're

to the first idea and if you're retrieving four or five chunks hopefully

retrieving four or five chunks hopefully both those chunks show up right

both those chunks show up right let's say if you make a chunk which has

let's say if you make a chunk which has certain idea and that idea comes later

certain idea and that idea comes later at the end of the document if both ideas

at the end of the document if both ideas are very similar both those chunks will

are very similar both those chunks will be retrieved at the end

be retrieved at the end wouldn't it be a better strategy to

wouldn't it be a better strategy to check cosine sim it will be I I agree

check cosine sim it will be I I agree but uh

but uh the whole idea is that again the time

the whole idea is that again the time also increases right if you want to

also increases right if you want to check the semantic similarity with all

check the semantic similarity with all the previous sentences.

the previous sentences. It's a bit time consuming also. You kind

It's a bit time consuming also. You kind of hope that the cosine similarity

of hope that the cosine similarity formula is such that if you take

formula is such that if you take dotproduct of two vectors a dot b is

dotproduct of two vectors a dot b is higher and b dot c. So if you think in

higher and b dot c. So if you think in terms of angle if a dot b is higher they

terms of angle if a dot b is higher they are similar. If b dot c is higher b and

are similar. If b dot c is higher b and c are also having similar angles.

c are also having similar angles. Uh so you can say that a and c will also

Uh so you can say that a and c will also be somewhat similar to each other.

Is it based on assumption that next line will be similar to semantically similar

will be similar to semantically similar to yeah uh that is also true that is the

to yeah uh that is also true that is the same thing which was exploited in in

same thing which was exploited in in what's the word to it neighbors usually

what's the word to it neighbors usually carry similar meaning right because you

carry similar meaning right because you would not have random lines usually

would not have random lines usually subsequently added next to each

Samarat has said, "So should we have structure?" Yeah. Yeah. Correct. This

structure?" Yeah. Yeah. Correct. This level of organization which you

level of organization which you mentioned that can also be at a

mentioned that can also be at a paragraph level in semantic chunking.

paragraph level in semantic chunking. If your sentences are not varying too

If your sentences are not varying too much in meaning, you can have one big

much in meaning, you can have one big paragraph as one chunk. But then again

paragraph as one chunk. But then again you will have to do structural chunking

you will have to do structural chunking followed by semantic chunking which is

followed by semantic chunking which is done. I'll come to that later. So that

done. I'll come to that later. So that naturally brings us to actually first

naturally brings us to actually first let me cover structural chunking.

let me cover structural chunking. Structural chunking according to me is

Structural chunking according to me is the most intuitive form of chunking and

the most intuitive form of chunking and this can be com combined with semantic

this can be com combined with semantic chunking also.

chunking also. Structural chunking is essentially like

Structural chunking is essentially like let's say you are considering a

let's say you are considering a shareholder letter, right? So if you see

Yeah, if you take a look at the shareholder letter and they are going to

shareholder letter and they are going to release it quarterly with the same kind

release it quarterly with the same kind of sections. Structural chunking takes

of sections. Structural chunking takes advantage of that approach where it says

advantage of that approach where it says that

that we are going to split the report exactly

we are going to split the report exactly at these section boundaries. The first

at these section boundaries. The first chunk is going to be letter to

chunk is going to be letter to shareholders. The second chunk is going

shareholders. The second chunk is going to be introduction. The third chunk is

to be introduction. The third chunk is going to be company overview. The fourth

going to be company overview. The fourth chunk is going to be financial

chunk is going to be financial statements. The fifth chunk is going to

statements. The fifth chunk is going to be notes to the financial statements.

be notes to the financial statements. Sixth chunk is going to be conclusion

Sixth chunk is going to be conclusion and outlook. That's it. It's extremely

and outlook. That's it. It's extremely simple, right? And believe it or not in

simple, right? And believe it or not in industrial problems structural chunking

industrial problems structural chunking solves many issues because

solves many issues because it depends if you are in a financial

it depends if you are in a financial sector or if you're in a medical sector

sector or if you're in a medical sector and if you are looking at a very

and if you are looking at a very specific rag application

specific rag application it is very likely that the application

it is very likely that the application stays the same across multiple

stays the same across multiple documents. For example, if you're

documents. For example, if you're building a conversational therapist rack

building a conversational therapist rack chatbot, the therapist might be making

chatbot, the therapist might be making notes after each session in a specific

notes after each session in a specific format. So the therapist might be

format. So the therapist might be writing introduction or the key things

writing introduction or the key things which we discussed in the session, key

which we discussed in the session, key takeaways. So as long as you know the

takeaways. So as long as you know the structure of your documents, structural

structure of your documents, structural chunking is the most intuitive and the

chunking is the most intuitive and the best thing you can do when you receive

best thing you can do when you receive any problem. If the problem is a bit

any problem. If the problem is a bit more structured, if it's messy like what

more structured, if it's messy like what we have seen here, then of course it

we have seen here, then of course it will not work. But if let's say if you

will not work. But if let's say if you have hospital records or if you have

have hospital records or if you have stock price information in a specific

stock price information in a specific tabular format or in a specific

tabular format or in a specific structure format, you can always

structure format, you can always leverage that structure. The more you

leverage that structure. The more you leverage structures in a documents, the

leverage structures in a documents, the more grounded your retrieval augmented

more grounded your retrieval augmented generation system is going to be hands

generation system is going to be hands down at all times. So the first strategy

down at all times. So the first strategy which I always intuitively also it comes

which I always intuitively also it comes naturally to me is just go to chunk go

naturally to me is just go to chunk go to structure level chunks right

uh but then what are the issues of structural chunking?

Can you think of any issues with structurebased chunking? In fact, many

structurebased chunking? In fact, many of you when you saw this document, the

of you when you saw this document, the first thing which intuitively came to

first thing which intuitively came to mind is structure based on sections and

mind is structure based on sections and subsections. That's exactly structural

subsections. That's exactly structural based chunking. What are the issues with

based chunking. What are the issues with this?

Yeah, the issue with this is that the u

u one chunk can usually be very large

one chunk can usually be very large because what if in one particular

because what if in one particular shareholder letter the introduction

shareholder letter the introduction section is five times longer than

section is five times longer than others.

others. Then the chunk size becomes very large

Then the chunk size becomes very large then it that chunk will be retrieved to

then it that chunk will be retrieved to the language model right and will be

the language model right and will be added to its context. So then the

added to its context. So then the context window of the language model

context window of the language model becomes again very large. So the same

becomes again very large. So the same problem which we set out to solve, we

problem which we set out to solve, we are again encountering the same issue

are again encountering the same issue again. So the advantage of a structured

again. So the advantage of a structured approach is that it's very good for

approach is that it's very good for documents whose data comes in structured

documents whose data comes in structured format like section, section, subsection

format like section, section, subsection etc. But it's actually

etc. But it's actually not very good in terms of the fact that

not very good in terms of the fact that it can make chunks which are huge and

it can make chunks which are huge and that might increase the context length

that might increase the context length of LLMs and that might again lead to

of LLMs and that might again lead to more hallucinations.

How many of you actually know what metadata is?

uh can you so why have I mentioned metadata over here in structure based

metadata over here in structure based thinking in structure based chunking

thinking in structure based chunking yeah so data about data is metadata

yeah so data about data is metadata right essentially

right essentially if I know that a chunk belongs to a

if I know that a chunk belongs to a particular structure so let's say when I

particular structure so let's say when I store a particular chunk I also store

store a particular chunk I also store its metadata

its metadata that if I store an introduction chunk I

that if I store an introduction chunk I also uh store that it's an introduction

also uh store that it's an introduction chunk

chunk I might refer to it later.

I might refer to it later. So later if I want to collect all the

So later if I want to collect all the introductions, I might refer to this

introductions, I might refer to this metadata also. So structure based

metadata also. So structure based chunking also has this added advantage

chunking also has this added advantage that since you know which chunk

that since you know which chunk corresponds to which structure. For

corresponds to which structure. For example, you know that this chunk

example, you know that this chunk corresponds to company overview. you can

corresponds to company overview. you can store that as a metadata and then you

store that as a metadata and then you can access that metadata later

can access that metadata later downstream in your application if there

downstream in your application if there is a need.

Uh now Samrat had also mentioned that can we in semantic chunking instead of

can we in semantic chunking instead of having sentence as chunks

having sentence as chunks or instead of level of organization at

or instead of level of organization at the sentence level can we have the level

the sentence level can we have the level of organization at a paragraph level. So

of organization at a paragraph level. So one paragraph will be added then the

one paragraph will be added then the semantic similarity with other paragraph

semantic similarity with other paragraph will be compared. So if you want to do

will be compared. So if you want to do that approach you are essentially

that approach you are essentially combining structural chunking with

combining structural chunking with semantic chunking because first you you

semantic chunking because first you you will use structural chunking to find the

will use structural chunking to find the paragraphs

paragraphs then you will use semantic chunking on

then you will use semantic chunking on top of that. So that's a combined

top of that. So that's a combined approach and normally if one type of

approach and normally if one type of chunking fails it's very common to

chunking fails it's very common to combine two types of chunking methods

combine two types of chunking methods also.

also. So this disadvantage of structural

So this disadvantage of structural chunking which we saw the main

chunking which we saw the main disadvantage is that some chunks can be

disadvantage is that some chunks can be too large that is solved by recursive

too large that is solved by recursive chunking.

chunking. Recursive chunking is an amazing

Recursive chunking is an amazing chunking strategy because it's kind of

chunking strategy because it's kind of the best of both worlds. It exploits the

the best of both worlds. It exploits the structure. So it exploits the structure

structure. So it exploits the structure of

of documents

but it also kind of makes sure that chunk size remains consistent.

How does it do it? So let's take a practical example actually.

Yeah, let's take a look at let's say you are building a rag chatbot

let's say you are building a rag chatbot which analyzes research papers.

which analyzes research papers. Now you know that if you are analyzing

Now you know that if you are analyzing research papers belonging to a

research papers belonging to a particular journal, the structure is

particular journal, the structure is going to remain the same, right? If I'm

going to remain the same, right? If I'm looking at patterns, they don't accept

looking at patterns, they don't accept papers if the structure is too

papers if the structure is too different. So, I know that this is going

different. So, I know that this is going to have a

to have a uh I know that this is going to have

uh I know that this is going to have some kind of an introduction section for

some kind of an introduction section for sure. It's going to have a summary

sure. It's going to have a summary summary se section. It's going to have a

summary se section. It's going to have a results section. Uh it's going to have

results section. Uh it's going to have finally the conclusion and discussion

finally the conclusion and discussion section. Yeah. And then towards the end

section. Yeah. And then towards the end there will be references and then it

there will be references and then it will end. You know this is the structure

will end. You know this is the structure but in some papers the result section

but in some papers the result section can be too long compared to other

can be too long compared to other papers. So you cannot use structural

papers. So you cannot use structural chunking. Simple thing is to just use

chunking. Simple thing is to just use structural chunking right and each

structural chunking right and each section can be one chunk. So what

section can be one chunk. So what recursive chunking does is that first I

recursive chunking does is that first I will make chunks based on my sections

will make chunks based on my sections right. So introduction section will be

right. So introduction section will be one chunk, result section will be one

one chunk, result section will be one chunk etc. Then I will look at my chunk

chunk etc. Then I will look at my chunk size and I will define a maximum chunk

size and I will define a maximum chunk size. If the maximum chunk size is 500,

size. If the maximum chunk size is 500, 500 tokens.

500 tokens. If one of my chunks is greater than the

If one of my chunks is greater than the maximum chunk size, I will chunk it

maximum chunk size, I will chunk it again.

again. How will I chunk it again? I'll have to

How will I chunk it again? I'll have to define one more level of chunking. So if

define one more level of chunking. So if the result section becomes too big, I'll

the result section becomes too big, I'll say that I'll chunk it at the paragraph

say that I'll chunk it at the paragraph level.

level. So then I'll again chunk based on

So then I'll again chunk based on paragraphs in the result section. So

paragraphs in the result section. So then each of these paragraphs will then

then each of these paragraphs will then become a separate chunk

become a separate chunk and then I what I do is that then I

and then I what I do is that then I again go to the paragraph level

again go to the paragraph level uh and then I see whether each token is

uh and then I see whether each token is greater than a particular token is

greater than a particular token is greater than my maximum size and if some

greater than my maximum size and if some paragraph like this is greater than the

paragraph like this is greater than the maximum size I will chunk it further to

maximum size I will chunk it further to another level which is my sentence level

another level which is my sentence level and then I will again check whether the

and then I will again check whether the number of tokens are greater or not. So

number of tokens are greater or not. So if you think about it, it's like that

if you think about it, it's like that kind of rush and all approach, right?

kind of rush and all approach, right? Where you take the a large level

Where you take the a large level chunking where you take section level

chunking where you take section level chunking,

chunking, then within that

then within that you have

you have paragraph level chunking.

paragraph level chunking. So you do paragraph level chunking only

So you do paragraph level chunking only when

only when chunk size is greater than the maximum chunk size.

maximum chunk size. So you do paragraph level chunking. Then

So you do paragraph level chunking. Then in paragraph level chunking again if the

in paragraph level chunking again if the chunk size is higher you do your final

chunk size is higher you do your final level of chunking which is your sentence

level of chunking which is your sentence level of chunking.

level of chunking. So here again you check whether the

So here again you check whether the chunk size is greater than the maximum.

chunk size is greater than the maximum. So since we are using different level of

So since we are using different level of chunkings one below each other. This

chunkings one below each other. This method is also called as recursive

method is also called as recursive chunking.

And the reason recursive chunking is the best of both worlds is because it's

best of both worlds is because it's preserving structure for sure, but it's

preserving structure for sure, but it's also making sure that none of my chunks

also making sure that none of my chunks are too large. So that won't affect my

are too large. So that won't affect my context size at all.

Uh let's see what if we comp combine structural with

what if we comp combine structural with semantic. So this we already discussed.

semantic. So this we already discussed. Is it possible to apply chunking

Is it possible to apply chunking strategies to images and videos in

strategies to images and videos in multi- model models? David, that's a

multi- model models? David, that's a great question. Actually, it is

great question. Actually, it is definitely possible to do that. So,

definitely possible to do that. So, think of images and videos in terms of

think of images and videos in terms of tokens, right?

tokens, right? Just like I'm talking about tokens for

Just like I'm talking about tokens for text. Images and videos also have

text. Images and videos also have tokens. They have different token

tokens. They have different token schemes. Uh

schemes. Uh but there are the tokens will be

but there are the tokens will be different. The tokens will be at an

different. The tokens will be at an image level. Uh so there you have to use

image level. Uh so there you have to use again you can use similar strategies but

again you can use similar strategies but the strategies are a bit more different

the strategies are a bit more different than what we are currently covering.

than what we are currently covering. Can you define chunk size? Yeah. Yeah.

Can you define chunk size? Yeah. Yeah. So basically

So basically one hyperparameter we have to define

one hyperparameter we have to define here is that

here is that I will define a maximum chunk size

I will define a maximum chunk size myself.

myself. Before I do recursive chunking I have to

Before I do recursive chunking I have to define a maximum chunk size. Let's say

define a maximum chunk size. Let's say that's going to be 500 tokens.

So at every stage I'm going to compare whether my chunks are greater than this

whether my chunks are greater than this size or not. So if I do section level

size or not. So if I do section level chunking each section chunk I will check

chunking each section chunk I will check its number of tokens. If it's greater

its number of tokens. If it's greater than 500 I'll do the second level of

than 500 I'll do the second level of recursive chunking which is paragraph.

recursive chunking which is paragraph. Then again if it's greater than 500 I'll

Then again if it's greater than 500 I'll do sentence level of chunking.

How is this different uh from fixed size chunking?

chunking? So it's completely different than fixed

So it's completely different than fixed size chunking, right? Because in fixed

size chunking, right? Because in fixed size chunking nowhere we are thinking

size chunking nowhere we are thinking about the structure.

about the structure. In fixed size chunking I will just start

In fixed size chunking I will just start randomly from my start and I will if my

randomly from my start and I will if my fixed size is 50 tokens I will take this

fixed size is 50 tokens I will take this 50 that will be my one chunk. I'll take

50 that will be my one chunk. I'll take this 50 that will be my second chunk.

this 50 that will be my second chunk. I'll take this 50 that will be my third

I'll take this 50 that will be my third chunk. Here what we are doing is that we

chunk. Here what we are doing is that we are doing structural chunking first. So

are doing structural chunking first. So first we break it down into sections. If

first we break it down into sections. If each section does not have too many

each section does not have too many characters or tokens then our chunking

characters or tokens then our chunking will be at the section level. Only if

will be at the section level. Only if one section is larger than a token size.

one section is larger than a token size. We will break it down further.

We will break it down further. Did everyone understand how this is

Did everyone understand how this is different from fixed size chunking?

different from fixed size chunking? Recursive chunking is actually

Recursive chunking is actually completely different than fixed size

completely different than fixed size chunking.

chunking. There is no similarity at all between

There is no similarity at all between recursive chunking and fixed size

recursive chunking and fixed size chunking because in recursive chunking

chunking because in recursive chunking we are not mentioning the size which we

we are not mentioning the size which we want. We are just mentioning the maximum

want. We are just mentioning the maximum chunk size.

There is a question how is the link saved in this chunk? It's not in

saved in this chunk? It's not in structural based chunking and recursive

structural based chunking and recursive chunking. The semantic notion is not

chunking. The semantic notion is not maintained here at all.

Are there libraries to do? Yes, definitely there are libraries. Both

definitely there are libraries. Both lang chain and langraph provide

lang chain and langraph provide libraries to do recursive and structural

libraries to do recursive and structural chunking. But today we are going to

chunking. But today we are going to implement these from scratch in Google

implement these from scratch in Google Collab. We are going to implement all of

Collab. We are going to implement all of these chunking strategies from scratch.

these chunking strategies from scratch. So amit it is not actually maintaining

So amit it is not actually maintaining semantics because nowhere does it know

semantics because nowhere does it know what is mentioned in the section or

what is mentioned in the section or subsection or paragraph or sentence.

subsection or paragraph or sentence. So to those people who asked the

So to those people who asked the question about fixed size versus

question about fixed size versus recursive chunking is it clear how it is

recursive chunking is it clear how it is different? I think sankit asked and

different? I think sankit asked and Krishna also asked if that is your main

Krishna also asked if that is your main question it means there is some

question it means there is some conceptual gap.

conceptual gap. If there is no link, I might well as

If there is no link, I might well as look at the but there is a the the the

look at the but there is a the the the section is maintained, right?

section is maintained, right? So you understand the benefits of

So you understand the benefits of structural chunking

structural chunking the sections are maintained. So think of

the sections are maintained. So think of recursive chunking as a supererset of uh

recursive chunking as a supererset of uh structural chunking. Which means that if

structural chunking. Which means that if you understand the benefits of

you understand the benefits of structural chunking by default you

structural chunking by default you already understand the benefits of

already understand the benefits of recurs recursive chunking

recurs recursive chunking because it is structural chunking but it

because it is structural chunking but it is a bit more clever because it ensures

is a bit more clever because it ensures that each chunk is not greater than a

that each chunk is not greater than a particular size.

particular size. Yeah, maintaining section somehow

Yeah, maintaining section somehow preserves. Yeah, that way that's

preserves. Yeah, that way that's correct. If you maintain a section it

correct. If you maintain a section it kind of preserves what we are talking

kind of preserves what we are talking about. You can think of it that way.

about. You can think of it that way. Then sank.

So in fixed chunking so let let's again go back to this example right of the

go back to this example right of the this example. This example would never

this example. This example would never happen in uh recursive chunking because

happen in uh recursive chunking because in recursive chunking we would have

in recursive chunking we would have defined that this whole thing should be

defined that this whole thing should be a one

a one uh one chunk. Let's say

uh one chunk. Let's say it's just constrained by the maximum

it's just constrained by the maximum chunk size. At the maximum chunk size, I

chunk size. At the maximum chunk size, I can define it to be slightly larger. So

can define it to be slightly larger. So that on an average, it takes into

that on an average, it takes into account all sections and subsections as

account all sections and subsections as one chunk.

Is recursive chunking clear to everyone? Out of all the chunking methods, this

Out of all the chunking methods, this Yeah, it is structural chunking uh on on

Yeah, it is structural chunking uh on on steroids if you think about it.

steroids if you think about it. That's clear, right?

That's clear, right? The last chunking strategy which to

The last chunking strategy which to explore is LLM based chunking. And this

explore is LLM based chunking. And this is that strategy where humans kind of

is that strategy where humans kind of gave up I think because

gave up I think because we are like let let me tell you what

we are like let let me tell you what happens in LLM chunking. Take a look at

happens in LLM chunking. Take a look at this conversation.

this conversation. This is a transcript from about a user

This is a transcript from about a user who is interacting with a chatbot about

who is interacting with a chatbot about this car.

it's this car actually. So Mahindra's car and uh the user is

So Mahindra's car and uh the user is kind of interacting with a bot.

kind of interacting with a bot. The context is that the user wants to

The context is that the user wants to book a test drive

book a test drive but the issue is that first the user

but the issue is that first the user asks about booking a test drive for this

asks about booking a test drive for this car. Okay. Uh

car. Okay. Uh then the user asks to compare

then the user asks to compare prices and uh compare the value for

prices and uh compare the value for money between the two brands. Then the

money between the two brands. Then the user says that let's go with the test

user says that let's go with the test drive for 3xo.

drive for 3xo. Then the user again switches to petrol

Then the user again switches to petrol or diesel option. Then the user again

or diesel option. Then the user again switches to where

switches to where uh could you share your specific

uh could you share your specific location. So user gives this location.

location. So user gives this location. So basically what I want to point out

So basically what I want to point out here is that this is a example of a

here is that this is a example of a conversation which is called as context

conversation which is called as context drift

because in one piece of conversation we are talking about first a test drive

are talking about first a test drive then we are talking about price

then we are talking about price comparison then we are talking about

comparison then we are talking about feature comparison then we are talking

feature comparison then we are talking about location for the test drive then

about location for the test drive then we are talking about petrol or diesel in

we are talking about petrol or diesel in one single conversation there are

one single conversation there are multiple context drifts which are

multiple context drifts which are happening. So if I ask the rag chatbot

happening. So if I ask the rag chatbot what happened in this conversation what

what happened in this conversation what did the user actually talk about

did the user actually talk about then you kind of need a system so that a

then you kind of need a system so that a chunk maintains context across drift

chunk maintains context across drift also.

also. So you need a system where definitely

So you need a system where definitely you need semantics to be maintained

you need semantics to be maintained here. You need semantics to be

here. You need semantics to be maintained and there is again no

maintained and there is again no structure in this conversation. So you

structure in this conversation. So you cannot three options are ruled out for

cannot three options are ruled out for you. You cannot do structural chunking,

you. You cannot do structural chunking, you cannot do recursive chunking and you

you cannot do recursive chunking and you cannot do fixed size chunking. You can

cannot do fixed size chunking. You can do semantic chunking for sure. But even

do semantic chunking for sure. But even in semantic chunking sometimes it might

in semantic chunking sometimes it might get difficult to analyze the entire flow

get difficult to analyze the entire flow of the conversation. Note notice where

of the conversation. Note notice where the drift is happening etc.

the drift is happening etc. So then the only way to do this is to

So then the only way to do this is to bring an LLM into the picture and then

bring an LLM into the picture and then you ask the LLM that start looking at

you ask the LLM that start looking at this entire context and give me those

this entire context and give me those points at which drift is happening and

points at which drift is happening and then break those points into chunks.

then break those points into chunks. So you give that in the prompt itself.

So you give that in the prompt itself. So nothing is defined a priority. You

So nothing is defined a priority. You tell the LLM that this is the cont

tell the LLM that this is the cont entire context which I have go through

entire context which I have go through this entire context and break it down

this entire context and break it down into chunks. how to break it down into

into chunks. how to break it down into chunks. Identify those points at which

chunks. Identify those points at which there is a context drift which is

there is a context drift which is happening uh and then do the chunking

happening uh and then do the chunking yourself.

yourself. So the reason I said humans gave up at

So the reason I said humans gave up at this point is because here we do not

this point is because here we do not really take structure into account. We

really take structure into account. We do not even specify too many things to

do not even specify too many things to constrain ourselves. We just tell LLM

constrain ourselves. We just tell LLM that you have to uh do this let's say.

that you have to uh do this let's say. So the advantage of course is high

So the advantage of course is high semantic accuracy and good for documents

semantic accuracy and good for documents with rapid context changes or

with rapid context changes or unstructured text really.

unstructured text really. So if you are handling voice

So if you are handling voice conversations as a chatbot where a

conversations as a chatbot where a person can ask multiple different things

person can ask multiple different things where context is drifting a lot you need

where context is drifting a lot you need to maintain semantic accuracy across

to maintain semantic accuracy across long sentences where LLM based junking

long sentences where LLM based junking can play a useful role. Disadvantage of

can play a useful role. Disadvantage of course computationally expensive context

course computationally expensive context window limitations because what if the

window limitations because what if the LLM determines the whole context is

LLM determines the whole context is important that can again lead to context

important that can again lead to context window limitations and the moment we

window limitations and the moment we have an LLM the output can be stoastic.

have an LLM the output can be stoastic. So if you pass this chatbot to a client

So if you pass this chatbot to a client for the same question you may get

for the same question you may get different outputs

different outputs since an LLM is determining at what

since an LLM is determining at what point you want to split into different

point you want to split into different chunks.

Okay. So yeah, Sanjiv you pointed an issue with LLM context window which I

issue with LLM context window which I covered over here in the disadvantages

covered over here in the disadvantages if we are trying to retrieve similar

if we are trying to retrieve similar content about a topic from archive or

content about a topic from archive or newspaper article.

newspaper article. So then you can even

So then you can even so that's the rack system right

so that's the rack system right basically what Anita you're saying is

basically what Anita you're saying is you want to build a rack system from an

you want to build a rack system from an archive of newspaper articles

archive of newspaper articles newspaper articles if you they might not

newspaper articles if you they might not be organized into structures or

be organized into structures or subsections right so the best way to get

subsections right so the best way to get started with is either fixed size

started with is either fixed size chunking or semantic chunking or if you

chunking or semantic chunking or if you can do OCR and you can make sections

can do OCR and you can make sections based on font S so which I'm going to

based on font S so which I'm going to share with you is that

if you take a look at this newspaper article right

article right you definitely need an OCR tool but how

you definitely need an OCR tool but how would that OCR tool know whether there

would that OCR tool know whether there are sections or subsections one way to

are sections or subsections one way to tell the OCR tool is that uh this is a

tell the OCR tool is that uh this is a business review newspaper and usually

business review newspaper and usually they have this font so wherever this

they have this font so wherever this font is there make it as a section

font is there make it as a section wherever slightly smaller font is there

wherever slightly smaller font is there make it as a subsection etc. We are

make it as a subsection etc. We are going to see that today because if you

going to see that today because if you check take our document itself it's very

check take our document itself it's very difficult for our PDF tool to know where

difficult for our PDF tool to know where there are sections and subsections.

there are sections and subsections. So we are going to do a small trick when

So we are going to do a small trick when we code where we are going to identify

we code where we are going to identify where sections and subsections are

where sections and subsections are there. So if you can do this trick on

there. So if you can do this trick on your newspaper articles and if you can

your newspaper articles and if you can figure out sections and subsections you

figure out sections and subsections you can do structural drag.

can do structural drag. If you cannot figure that out, you can

If you cannot figure that out, you can do fixed size chunking as a start. If it

do fixed size chunking as a start. If it does not work, you can do semantic

does not work, you can do semantic chunking.

What do you mean by stoastic output? So by stoastic output, I mean since you're

by stoastic output, I mean since you're using an LLM by default, it's a

using an LLM by default, it's a probabilistic model, right? So if you

probabilistic model, right? So if you use one LLM and again try to use it the

use one LLM and again try to use it the next day you might maybe or use a

next day you might maybe or use a different LLM you get a different

different LLM you get a different answer.

answer. So by default the stoasticity is

So by default the stoasticity is embedded whenever you use an LLM. Jade

embedded whenever you use an LLM. Jade is asking there might be a relationship

is asking there might be a relationship between embedding size and chunk size.

between embedding size and chunk size. Yeah. Yeah. Sure. So that's a trade-off

Yeah. Yeah. Sure. So that's a trade-off right between currently we are

right between currently we are discussing the trade-offs based on

discussing the trade-offs based on structure versus no structure semantics.

structure versus no structure semantics. no semantics. Another trade-off is that

no semantics. Another trade-off is that also when we discussed structural

also when we discussed structural chunking disadvantage, we said we said

chunking disadvantage, we said we said that potentially two course chunks that

that potentially two course chunks that might affect the context window of LLMs.

might affect the context window of LLMs. Another issue is that if you have chunks

Another issue is that if you have chunks which are too large, you will need to

which are too large, you will need to use larger embedding models because if

use larger embedding models because if certain chunk is too large, you cannot

certain chunk is too large, you cannot let's say use this sentence all MP net

let's say use this sentence all MP net base V2 since the input text is only 384

base V2 since the input text is only 384 pieces.

Uh so for larger we have to use VLM 1k token size and for paragraph maybe B.

token size and for paragraph maybe B. Yeah sure. Um

Yeah sure. Um VLM though you use if you use Peter it's

VLM though you use if you use Peter it's always going to be extremely

always going to be extremely computationally expensive.

You can simply without using VLM if an OCR tool can be used that might be

OCR tool can be used that might be faster. Many people have asked question

faster. Many people have asked question about stoastic output. When I said

about stoastic output. When I said stoastic output I just meant since it's

stoastic output I just meant since it's an LLM and every time you interact with

an LLM and every time you interact with an LLM the output can be different.

How about documents which are related to each other? What chunking strategy would

each other? What chunking strategy would you use? So good question.

you use? So good question. uh if the documents so if the documents

uh if the documents so if the documents are related to each other right then

are related to each other right then hopefully the chunks will also be

hopefully the chunks will also be related to each other and then when you

related to each other and then when you retrieve the chunks you can just

retrieve the chunks you can just retrieve more chunks so that chunks from

retrieve more chunks so that chunks from multiple documents are retrieved

embedding model like gecko embedding as fixed size dimension 768 how do we

fixed size dimension 768 how do we choose embedding model we are going to

choose embedding model we are going to come to that so there is a separate

come to that so there is a separate section which I have on embedding and we

section which I have on embedding and we have a engineering choice section there

have a engineering choice section there also but that will come tomorrow.

also but that will come tomorrow. Today what we have to do is focus on

Today what we have to do is focus on chunking right now. So let me come to an

chunking right now. So let me come to an engineer's choice section over here

engineer's choice section over here and uh let me share with all of you

and uh let me share with all of you which chunking strategy to use at what

which chunking strategy to use at what time.

time. Okay. Um

Okay. Um so the five things which we discussed

so the five things which we discussed right the main thing is that if you have

right the main thing is that if you have data which is inorganized and messy and

data which is inorganized and messy and huge amount of data which has no no real

huge amount of data which has no no real structure no real shape go ahead with

structure no real shape go ahead with fixed size chunking and see if things

fixed size chunking and see if things are working. It's usually the simplest

are working. It's usually the simplest and the fastest method.

and the fastest method. Then if you're working with a client

Then if you're working with a client like medical sector, educational sector,

like medical sector, educational sector, healthcare sector who maintain records.

healthcare sector who maintain records. If records are maintained, usually there

If records are maintained, usually there will be some sort of a structure to the

will be some sort of a structure to the records. So you have to by default start

records. So you have to by default start with structure based chunking.

with structure based chunking. If you do structure based chunking and

If you do structure based chunking and if you see that some chunks are too

if you see that some chunks are too large, then you will have to split it

large, then you will have to split it recursively using recursive chunking.

recursively using recursive chunking. Now if you see that if if you are

Now if you see that if if you are transcribing if you are using

transcribing if you are using transcripts from a debate or from an

transcripts from a debate or from an educational video you want to make a

educational video you want to make a rack chatbot based on let's say the

rack chatbot based on let's say the build lm from scratch playlist but I

build lm from scratch playlist but I have not added timestamps over there you

have not added timestamps over there you will still need to make chunks where

will still need to make chunks where semantic integrity is maintained between

semantic integrity is maintained between ideas that's when you have to use

ideas that's when you have to use semantic chunking

semantic chunking and if everything fails if your the

and if everything fails if your the context which you are analyzing has

context which you are analyzing has sharp twists and turns.

sharp twists and turns. If there is drift in the user

If there is drift in the user interactions,

interactions, then you should do LLM chunking. But

then you should do LLM chunking. But this is extremely expensive. So, usually

this is extremely expensive. So, usually no one does it for large documents. You

no one does it for large documents. You can do it for uh a smaller collection of

can do it for uh a smaller collection of documents.

All of these things which I me which I'm mentioning right now, right? I have

mentioning right now, right? I have actually collected all of those in uh

actually collected all of those in uh in this chunking strategies guide book

in this chunking strategies guide book which I shared with all of you.

which I shared with all of you. Actually, let me add this in the drive

Actually, let me add this in the drive folder right now. I have shared this

folder right now. I have shared this drive folder at the beginning of

drive folder at the beginning of today's class. So, I have added this PDF

today's class. So, I have added this PDF over here.

over here. This PDF.

Yeah, this PDF, right? So if you double click on this PDF, you'll see this

click on this PDF, you'll see this 20page PDF which contains all the

20page PDF which contains all the different types of chunking which we

different types of chunking which we have discussed

have discussed and it also contains a very detailed

and it also contains a very detailed section of which chunking strategy to

section of which chunking strategy to use at what point. Some of the examples

use at what point. Some of the examples which I have shown here are actually

which I have shown here are actually some examples in industries which we

some examples in industries which we have implemented.

have implemented. And towards the end I have an additional

And towards the end I have an additional section which is uh I have not studied

section which is uh I have not studied these techniques in too much detail but

these techniques in too much detail but someone asked a question about

someone asked a question about multimodel aware chunking right. This is

multimodel aware chunking right. This is a topic of active research right now.

a topic of active research right now. Another topic of active research is

Another topic of active research is something called query directed chunking

something called query directed chunking or dynamic chunking where chunking is

or dynamic chunking where chunking is done on the fly. So instead of doing

done on the fly. So instead of doing chunking before and so the rack pipeline

chunking before and so the rack pipeline we're discussing is sequential right

we're discussing is sequential right chunking embedding retrieval instead of

chunking embedding retrieval instead of doing chunking like in a predefined way

doing chunking like in a predefined way you can do chunking on the fly.

you can do chunking on the fly. So whenever a question is asked you form

So whenever a question is asked you form a chunk at that time itself that's

a chunk at that time itself that's called query directed chunking. So if

called query directed chunking. So if anyone wants to get into chunking

anyone wants to get into chunking related research these are some very

related research these are some very good topics. Multimodel aware chunking

good topics. Multimodel aware chunking is one topic. Query directed chunking is

is one topic. Query directed chunking is another topic.

Um there was a section which I have

there was a section which I have actually which we can go through quickly

actually which we can go through quickly and this will again make sure that the

and this will again make sure that the questions are clear. So if you

questions are clear. So if you so these are some class questions to

so these are some class questions to check which trunking strategy to use.

check which trunking strategy to use. Let's say if you're in a legal domain,

Let's say if you're in a legal domain, you're building a system to answer

you're building a system to answer questions about a country's laws and

questions about a country's laws and laws are divided into articles and

laws are divided into articles and sections. If you are building a chatbot,

sections. If you are building a chatbot, what chunking strategy will you use over

what chunking strategy will you use over here?

Yeah. So those of you who are answering structural or recursive, that's the

structural or recursive, that's the correct answer. So I would actually go

correct answer. So I would actually go ahead with structural first. If it leads

ahead with structural first. If it leads to larger chunks, I would go ahead with

to larger chunks, I would go ahead with uh recursive. I would encourage everyone

uh recursive. I would encourage everyone to answer on chat because this will test

to answer on chat because this will test your understanding your knowledge.

your understanding your knowledge. Second is a financial domain. So if you

Second is a financial domain. So if you have an earnings call transcription and

have an earnings call transcription and want to do Q&A on them

want to do Q&A on them where it might not have clear headings

where it might not have clear headings for each question then what will you do

for each question then what will you do in this stage

in this stage where you just have the transcripts of

where you just have the transcripts of the calls and you want to have a Q&A

the calls and you want to have a Q&A chatbot.

Yeah. So the correct answer here is I would actually go ahead with a fixed

would actually go ahead with a fixed chunking initially to see how it is

chunking initially to see how it is working because it's fast and if it

working because it's fast and if it produces answers which are not coherent

produces answers which are not coherent I would go ahead with semantic chunking.

I would go ahead with semantic chunking. So the correct answer to the first

So the correct answer to the first question is

structural and recursive. The correct answer to the second question is

answer to the second question is semantic.

semantic. What about the third? You are processing

What about the third? You are processing patient electronic health records where

patient electronic health records where there are fields like chief complaint,

there are fields like chief complaint, history of present illness, lab results,

history of present illness, lab results, assessment and plan etc. Which chunking

assessment and plan etc. Which chunking strategy you would go ahead with in this

strategy you would go ahead with in this case?

Exactly. So all of you are now getting the correct answer here. You will

the correct answer here. You will definitely go ahead with the structural.

definitely go ahead with the structural. And the last example I have is again

And the last example I have is again from this build LLM from let's say if

from this build LLM from let's say if you have this playlist and if you want

you have this playlist and if you want to have a set of lecture transcripts

to have a set of lecture transcripts which you want to make it into a chatbot

which you want to make it into a chatbot based system how will you do this

semantic or if semantic does not work maybe LLM based chunking

maybe LLM based chunking good so now I hope that chunking

good so now I hope that chunking strategies are clear for everyone

strategies are clear for everyone what are chunking ing strategies. Why we

what are chunking ing strategies. Why we should use each chunking strategy and

should use each chunking strategy and the best way to evaluate a chunking

the best way to evaluate a chunking strategy is of course based on the

strategy is of course based on the responses right it is not so only after

responses right it is not so only after the whole pipeline is done you will be

the whole pipeline is done you will be able to evaluate the chunking strategy

able to evaluate the chunking strategy in fact this code file which I have

in fact this code file which I have shared with you like this LLM production

shared with you like this LLM production rack main code file I have two other

rack main code file I have two other code files

code files uh which are related to semantic

uh which are related to semantic chunking

chunking and which are related related to

and which are related related to structural chunking. So what I have done

structural chunking. So what I have done is that in the current code file which

is that in the current code file which we'll start exploring again from

we'll start exploring again from tomorrow, we are going to use fixedsiz

tomorrow, we are going to use fixedsiz chunking but I have run this exact same

chunking but I have run this exact same code file using

code file using semantic chunking as well as structural

semantic chunking as well as structural chunking and then when you reach the end

chunking and then when you reach the end of both of these you can clearly compare

of both of these you can clearly compare the differences between which chunking

the differences between which chunking to use in your system. So even when you

to use in your system. So even when you have decided the chunking strategy only

have decided the chunking strategy only after you have run the whole rack

after you have run the whole rack pipeline you will be in a situation to

pipeline you will be in a situation to compare which strategy is the best.

Uh but you have to go through the full rack pipeline. So it's similar to tuning

rack pipeline. So it's similar to tuning itself. You can think of it as a

itself. You can think of it as a hyperparameter and you should have some

hyperparameter and you should have some rag evaluation framework and just check

rag evaluation framework and just check the metric in your framework. So then

the metric in your framework. So then later you can have a graph between

later you can have a graph between uh later you can have a graph between

uh later you can have a graph between rag evaluation

rag evaluation rag u frameworks and your chunking

rag u frameworks and your chunking strategy. So chunking strategy number

strategy. So chunking strategy number one chunking strategy number two

one chunking strategy number two chunking strategy number three. So it

chunking strategy number three. So it might happen that on one evaluation some

might happen that on one evaluation some chunking strategy might be working

chunking strategy might be working better on other framework on other

better on other framework on other evaluation metric some other chunking

evaluation metric some other chunking strategy would do better etc.

strategy would do better etc. That's the way you decide what chunking

That's the way you decide what chunking strategy to deploy in practice

uh abilation study and intuition I would say because

say because I'm going to now take you through a code

I'm going to now take you through a code where you can actually visualize the

where you can actually visualize the chunks

chunks and before doing the rest of the rack

and before doing the rest of the rack pipeline you can visualize the chunks

pipeline you can visualize the chunks you can check their size you can check

you can check their size you can check the variance in the sizes and then you

the variance in the sizes and then you can try to get to an intuition of

can try to get to an intuition of whether this will work for your rag rag

whether this will work for your rag rag application or not. That intuition is

application or not. That intuition is important to develop which I'm going to

important to develop which I'm going to show you right now when we are going to

show you right now when we are going to look through code.

look through code. So in the code which I'm which I'll

So in the code which I'm which I'll share with all of you right now what we

share with all of you right now what we are going to do is that all these

are going to do is that all these evaluation strategies which we have seen

evaluation strategies which we have seen right now we are going to evaluate these

right now we are going to evaluate these strategies and when I say evaluate we

strategies and when I say evaluate we are going to look at the different

are going to look at the different chunks which are being formed the sizes

chunks which are being formed the sizes of the chunks we are going to look at

of the chunks we are going to look at the size variance in the chunks this can

the size variance in the chunks this can be one evaluation which you can do

be one evaluation which you can do before even running the entire rack

before even running the entire rack pipeline. So what all you have learned

pipeline. So what all you have learned so far in theory, we are now going to

so far in theory, we are now going to put it in practice.

put it in practice. We are going to uh put it in practice

We are going to uh put it in practice when we are going to run this chunking

when we are going to run this chunking strategies notebook. So I'm just going

strategies notebook. So I'm just going to share it with all of you.

Just take a look at this code again here. Before going through the code, I'm

here. Before going through the code, I'm going to take a break of around 2 to 3

going to take a break of around 2 to 3 minutes

minutes and then uh this will be the last

and then uh this will be the last section which we will cover today.

Tomorrow we are going to look at embeddings and assembling the whole

embeddings and assembling the whole pipeline

pipeline um and the remaining aspects. But today

um and the remaining aspects. But today until this point it's going to be the

until this point it's going to be the last thing. So we are already at 2 and a

last thing. So we are already at 2 and a half hours into the workshop. Uh I don't

half hours into the workshop. Uh I don't know how much more time this will take

know how much more time this will take but I'll try to finish it uh in the next

but I'll try to finish it uh in the next 30 30 minutes or so.

30 30 minutes or so. So let me stop for some time and I'll

So let me stop for some time and I'll come back maybe in 3 to 4 minutes at

come back maybe in 3 to 4 minutes at 10:40 or 10:41 a.m. IST.

10:40 or 10:41 a.m. IST. Uh thanks guys for those of you who have

Uh thanks guys for those of you who have who have still here. It's great u that

who have still here. It's great u that you are continuing to follow. We only

you are continuing to follow. We only have around 30 to 35 more minutes to go.

have around 30 to 35 more minutes to go. Um after that we'll see the rest in

Um after that we'll see the rest in tomorrow's lecture. So I'll come back in

tomorrow's lecture. So I'll come back in some time.

Okay. So now let us continue with the last part of today's lecture.

last part of today's lecture. uh there is a question on the chat

uh there is a question on the chat related to how do we decide the chunk

related to how do we decide the chunk size if we are going ahead with the

size if we are going ahead with the combination of structural plus recursive

combination of structural plus recursive or legal framework. So that is going to

or legal framework. So that is going to be based on the document itself. So the

be based on the document itself. So the way I would do it is that let's say if

way I would do it is that let's say if you look at a document

you look at a document uh and you try to kind of form a vague

uh and you try to kind of form a vague notion of what is the average section

notion of what is the average section size or what's the median value of the

size or what's the median value of the section size length and you can use that

section size length and you can use that as the maximum chunk size. So this will

as the maximum chunk size. So this will make sure that at least major sections

make sure that at least major sections which are not big enough are retained

which are not big enough are retained into one chunk and then the sections

into one chunk and then the sections which are outlier in terms of their

which are outlier in terms of their length can be in separate chunks.

Okay. So I hope all of you have access to this. Let's start running the first

to this. Let's start running the first code cell and let's start running the

code cell and let's start running the second code cell also.

second code cell also. In this what we are doing here is that

In this what we are doing here is that until step number two we have already

until step number two we have already seen in the previous

seen in the previous code right we are just loading the

code right we are just loading the document over here and uh we are

document over here and uh we are collecting this list of multiple

collecting this list of multiple dictionaries and each dictionary will be

dictionaries and each dictionary will be corresponding to a page number the

corresponding to a page number the character count on that page the word

character count on that page the word count on that page sentence count on

count on that page sentence count on that page and token count.

that page and token count. This will again take some time to run

This will again take some time to run but we are on T4 GPU at the moment.

How many of you have finished running step number zero, step number one, and

step number zero, step number one, and step number two? Can you mention in the

step number two? Can you mention in the chat?

chat? Done. Right. Okay. So, I I'm running

Done. Right. Okay. So, I I'm running this a bit late. So, it is taking a bit

this a bit late. So, it is taking a bit of time for me. But meanwhile, let's go

of time for me. But meanwhile, let's go through step number three. Now this is

through step number three. Now this is the main step. What we are doing here is

the main step. What we are doing here is that we are testing five chunking

that we are testing five chunking strategies on our data set. What is our

strategies on our data set. What is our data set? Our data set is this

data set? Our data set is this which is around 1200 page PDF and we are

which is around 1200 page PDF and we are testing out different chunking

testing out different chunking strategies on this data. The first

strategies on this data. The first chunking strategy is where we are going

chunking strategy is where we are going to have 500 characters in one chunk.

to have 500 characters in one chunk. That's it. That's fixed size chunking.

That's it. That's fixed size chunking. Right? So the simp a simple thing to do

Right? So the simp a simple thing to do here is that you just go through all the

here is that you just go through all the words in a page and uh you keep on

words in a page and uh you keep on adding the words until they hit this

adding the words until they hit this fixed size

fixed size and once you become greater than that

and once you become greater than that size then you stop that chunk.

size then you stop that chunk. So this is the simplest one to actually

So this is the simplest one to actually explain and also to execute. Basically,

explain and also to execute. Basically, you just uh you pass the text of a page

you just uh you pass the text of a page into this chunk text function.

into this chunk text function. And this chunk text function, what it

And this chunk text function, what it does is that it just looks at that text

does is that it just looks at that text and keeps on forming groups of 500

and keeps on forming groups of 500 characters into each chunk.

characters into each chunk. So you can run this right now.

So you can run this right now. So you can see that the total chunks are

So you can see that the total chunks are 3321

3321 where each chunk is of 500 characters in

where each chunk is of 500 characters in our current PDF.

our current PDF. Now what I wanted to do after this point

Now what I wanted to do after this point is I actually want to visualize what

is I actually want to visualize what every chunk looks like. So I have

every chunk looks like. So I have written a simple function over here

written a simple function over here which just prints out uh the chunk and

which just prints out uh the chunk and what it looks like. So you can run this

what it looks like. So you can run this and you will see the chunks actually

and you will see the chunks actually look like this. In this function, I'm

look like this. In this function, I'm getting five

getting five um

um I'm getting five chunks which are

I'm getting five chunks which are randomly scattered throughout the data

randomly scattered throughout the data set. So take a look at the number of

set. So take a look at the number of characters in this chunk. You'll see

characters in this chunk. You'll see that in some places there are character

that in some places there are character numbers 290. Can someone tell me why

numbers 290. Can someone tell me why this is happening? It should ideally be

this is happening? It should ideally be 500, right?

Why are the number of characters here 290?

The reason this is happening is because we are looking at every uh page

we are looking at every uh page separately. So maybe we form one chunk,

separately. So maybe we form one chunk, we form the second chunk and the last

we form the second chunk and the last chunk only has this much.

chunk only has this much. That's why it might happen that some

That's why it might happen that some chunks would have lower number of

chunks would have lower number of characters than other chunks.

Um, but if you take a look at these five

but if you take a look at these five chunks, you'll see that mostly their

chunks, you'll see that mostly their chunk size, the character size will be

chunk size, the character size will be around 490 to 500.

around 490 to 500. And this is how every chunk will look

And this is how every chunk will look like. So now when I told you at the

like. So now when I told you at the start of the lecture that

start of the lecture that something will be retrieved.

something will be retrieved. If you remember at the start we saw that

If you remember at the start we saw that something will be retrieved and passed

something will be retrieved and passed to the LLM. Right? This retrieved

to the LLM. Right? This retrieved context

context this relevant context I called it I

this relevant context I called it I called it relevant context over here.

called it relevant context over here. This relevant context will be retrieved

This relevant context will be retrieved and passed to the LLM. That is these

and passed to the LLM. That is these chunks. Each chunk is a relevant

chunks. Each chunk is a relevant context. So when I say relevant context,

context. So when I say relevant context, it will be these chunks which are passed

it will be these chunks which are passed to the LLM. And now along with the

to the LLM. And now along with the prompt, the LLM will also have access to

prompt, the LLM will also have access to these chunks. So it's similar to in an

these chunks. So it's similar to in an open book exam. If you you are doing

open book exam. If you you are doing this open book test and you think

this open book test and you think something is important, you will maybe

something is important, you will maybe highlight this and you will utilize this

highlight this and you will utilize this information for answering. Right? So

information for answering. Right? So this highlighting a piece of text while

this highlighting a piece of text while solving that open book exam is exactly

solving that open book exam is exactly similar to these chunks which I have

similar to these chunks which I have shown over here

where we have defined that we are looking page wise. The place where we

looking page wise. The place where we have defined it is over here. So

have defined it is over here. So see here the way it is happening is that

see here the way it is happening is that we are first extracting the text only

we are first extracting the text only for a certain page. So the for loop goes

for a certain page. So the for loop goes over pages. We are looking at one page

over pages. We are looking at one page and one page text we are passing to this

and one page text we are passing to this chunk text function.

chunk text function. So it will form chunks for that one

So it will form chunks for that one page. Then we go to the next page.

page. Then we go to the next page. Uh likewise

uh okay so this is the first method of chunking. The second method of chunking

chunking. The second method of chunking which we are going to see is semantic

which we are going to see is semantic chunking. So here you will need to

chunking. So here you will need to install the sentence transformers

install the sentence transformers package and the sentence transformer

package and the sentence transformer which we are going to use here is all

which we are going to use here is all mini LM L6 V2. So let me show you

mini LM L6 V2. So let me show you this is the one which we are going to

this is the one which we are going to use

use and uh

yeah you can see the number of downloads for this. It's around uh

for this. It's around uh 9

9 90 million I think 90 million downloads

90 million I think 90 million downloads per month. It's a very popular sentence

per month. It's a very popular sentence transformer model. We are going to so

transformer model. We are going to so remember in semantic chunking what we

remember in semantic chunking what we saw is that

saw is that in semantic chunking

you are going to take every sentence you're going to convert it into a vector

you're going to convert it into a vector and you are going to keep on adding the

and you are going to keep on adding the remaining sentences until the cosine

remaining sentences until the cosine similarity becomes greater than a

similarity becomes greater than a certain threshold.

Where is my file? Yeah. So this is my sentence transformer and in semantic

sentence transformer and in semantic chunking what we are doing is that we

chunking what we are doing is that we are ultimately going to call the same

are ultimately going to call the same function. We are going to look at every

function. We are going to look at every page and we are going to call this

page and we are going to call this semantic chunk text for every page. And

semantic chunk text for every page. And in this semantic chunk text what we are

in this semantic chunk text what we are going to do is that we are going to

going to do is that we are going to first break that page into sentences.

first break that page into sentences. For that we are going to use this sent

For that we are going to use this sent tokenize function from the NLTK library.

tokenize function from the NLTK library. U and then we are going to keep on

U and then we are going to keep on appending the sentences to the current

appending the sentences to the current chunk until the similarity score is

chunk until the similarity score is greater than the similarity threshold.

greater than the similarity threshold. If the similarity score is not greater

If the similarity score is not greater then we break out of this loop.

then we break out of this loop. Basically

Basically I hope all of you understand this logic

I hope all of you understand this logic in the code. It's exactly similar to

in the code. It's exactly similar to what we had imple what we had discussed

what we had imple what we had discussed over here. you essentially maintain the

over here. you essentially maintain the new chunk and you keep on adding new

new chunk and you keep on adding new sentences to that until the similarity

sentences to that until the similarity score becomes greater. There is a

score becomes greater. There is a question regarding how to know which

question regarding how to know which model to use. We are going to see

model to use. We are going to see tomorrow how to use embedding models.

tomorrow how to use embedding models. But as a baseline rule of thumb if you

But as a baseline rule of thumb if you are new to this field

are new to this field uh all mini LM L6 V2 which I will also

uh all mini LM L6 V2 which I will also share in the chat this this embedding

share in the chat this this embedding model and also the MPET

model and also the MPET V2

V2 uh the previous embedding model which we

uh the previous embedding model which we saw

saw MPET base not the base one

MPET base not the base one the V2 this is the one which are good

the V2 this is the one which are good starting points

starting points How about semantics chunking option from

How about semantics chunking option from lang chain? So as sumat you have noted

lang chain? So as sumat you have noted many frameworks give semantic chunking

many frameworks give semantic chunking options which are exactly similar to

options which are exactly similar to what I'm showing right now. The reason

what I'm showing right now. The reason I'm showing it from scratch is because

I'm showing it from scratch is because then you will find all those approaches

then you will find all those approaches extremely simple and easy to work with.

extremely simple and easy to work with. But under the hood under the hood they

But under the hood under the hood they are doing the same thing what we are

are doing the same thing what we are doing over here.

doing over here. So you can run this right now and what

So you can run this right now and what this will do is that this will actually

this will do is that this will actually need to download. So this line which I'm

need to download. So this line which I'm saying right now I'm going through it a

saying right now I'm going through it a bit quickly but you should appreciate

bit quickly but you should appreciate the power of the opensource community

the power of the opensource community over here because if all of these models

over here because if all of these models were not made open source and if all of

were not made open source and if all of these models were not uploaded on

these models were not uploaded on hugging face it would have been very

hugging face it would have been very difficult for us to utilize these models

difficult for us to utilize these models as simply right. So this looks like just

as simply right. So this looks like just one sentence but don't take this

one sentence but don't take this lightly. A whole revolution has happened

lightly. A whole revolution has happened open source revolution before we have

open source revolution before we have come to this stage where we can very

come to this stage where we can very openly use hugging face

openly use hugging face and run these code files in Google

and run these code files in Google collab.

collab. So I am using a similarity threshold of

So I am using a similarity threshold of 75 over here. You can feel free to use

75 over here. You can feel free to use a different similarity threshold as

a different similarity threshold as well. So here you can see we are now uh

well. So here you can see we are now uh doing semantic chunking for every page.

Oh, I think my audio was was cut off for some reason.

some reason. Yeah. Okay. Now,

u here we can see that the total number of semantic chunks which are formed are

of semantic chunks which are formed are 1206.

1206. Now, can you already see one thing? Why

Now, can you already see one thing? Why do you think the number of chunks in

do you think the number of chunks in this case are so much larger than the

this case are so much larger than the number of chunks which we saw in Psiz

number of chunks which we saw in Psiz chunking?

Can you try to think why the number of chunks in this case are so large in

chunks in this case are so large in number?

Yeah, the reason is because it really does not make sense to use semantic

does not make sense to use semantic chunking on our document. First of all,

chunking on our document. First of all, there is chapters are small and then

there is chapters are small and then chapters are not that related to each

chapters are not that related to each other. Even within a chapter, there

other. Even within a chapter, there might be concepts cover different

might be concepts cover different concepts in different paragraphs. So the

concepts in different paragraphs. So the semantic notion changes very quickly.

semantic notion changes very quickly. That's why very small chunks are

That's why very small chunks are becoming individual chunks which is an

becoming individual chunks which is an issue here. Uh so you can actually print

issue here. Uh so you can actually print out so if you print out the chunks in

out so if you print out the chunks in semantic chunking you'll see the wide

semantic chunking you'll see the wide variance right in some chunk is this

variance right in some chunk is this just this much some chunk has around 200

just this much some chunk has around 200 characters

characters uh some chunk has 78 characters that's

uh some chunk has 78 characters that's one disadvantage of semantic chunking.

one disadvantage of semantic chunking. Again there is diversity in the chunk

Again there is diversity in the chunk size

size but semantic notion between or within a

but semantic notion between or within a chunk is actually maintained.

chunk is actually maintained. So in in semantic chunking we got around

So in in semantic chunking we got around 12,000 total chunks.

12,000 total chunks. Why is there no continuity of text in

Why is there no continuity of text in this chunk? So I have just randomly

this chunk? So I have just randomly printed out not the entire chunk.

printed out not the entire chunk. uh but in some cases

uh but in some cases the reason there is no continuity is

the reason there is no continuity is after this sentence there might be

after this sentence there might be something new which might be starting

something new which might be starting completely right maybe it's not related

completely right maybe it's not related to the previous sentence at all let's

to the previous sentence at all let's just actually check this sentence now in

just actually check this sentence now in the document

So this is that sentence I think observing the connection between

observing the connection between beverage and longevity. Dr. Machik of

beverage and longevity. Dr. Machik of began his research on beneficial this

began his research on beneficial this sentence. Right?

sentence. Right? Now the thing is the sentence

Now the thing is the sentence transformer model. If you pass this

transformer model. If you pass this sentence and if you pass this sentence

sentence and if you pass this sentence according to the sentence transformer

according to the sentence transformer model the semantic similarity between

model the semantic similarity between this sentence and this sentence is less

this sentence and this sentence is less than 75.

Um so if you want to have longer chunks one thing which you can do is actually

one thing which you can do is actually reduce the threshold score over here.

reduce the threshold score over here. You can reduce the threshold score to 6.

You can reduce the threshold score to 6. you can reduce the threshold score to 0

you can reduce the threshold score to 0 55 also.

uh but I hope all of you can start seeing that what are we doing exactly

seeing that what are we doing exactly here and we are relying on this sentence

here and we are relying on this sentence transformer model right so it the

transformer model right so it the semantic threshold semantic score might

semantic threshold semantic score might be lesser than the threshold that's why

be lesser than the threshold that's why it's not characterized into one chunk

Now let me come to the third section which is recursive chunking.

which is recursive chunking. So the reason this collab code file has

So the reason this collab code file has become long is because later when you

become long is because later when you refer to it I have also mentioned

refer to it I have also mentioned detailed examples over here. So if you

detailed examples over here. So if you forget what was covered in the

forget what was covered in the mirrorboard notes, even if you have

mirrorboard notes, even if you have access to this code file, it'll be easy

access to this code file, it'll be easy for you to revise everything in one

for you to revise everything in one notebook.

notebook. So if you're going for an interview or

So if you're going for an interview or something later, just look at one

something later, just look at one notebook, run it and for chunking

notebook, run it and for chunking strategies which are a bit difficult to

strategies which are a bit difficult to understand such as semant recursive

understand such as semant recursive chunking, I have deliberately added this

chunking, I have deliberately added this text section over here.

text section over here. Now

Now uh

recursive uh the reason the way we are going to do recursive chunking is

going to do recursive chunking is something very simple over here. The

something very simple over here. The first so first as I mentioned we have to

first so first as I mentioned we have to define a maximum chunk size right which

define a maximum chunk size right which is equal to th00and

is equal to th00and um

um how which is this Google collab

how which is this Google collab notebook. So the Google collab notebook

notebook. So the Google collab notebook you mean the access to this notebook or

you mean the access to this notebook or I did not understand the question.

I did not understand the question. How do we access the mirror notebook?

How do we access the mirror notebook? Oh this notebook I have shared this

Oh this notebook I have shared this notebook link with everyone on the chat.

notebook link with everyone on the chat. Oh, this notebook, this is the same as

Oh, this notebook, this is the same as the one which I shared with you on uh

the one which I shared with you on uh chunking strategies. So, this copy of

chunking strategies. So, this copy of chunking strategies and what I'm doing

chunking strategies and what I'm doing right now is the same except that the

right now is the same except that the API keys are removed in this

murotes. I will be sharing with all registered people. I I'll just share the

registered people. I I'll just share the link to this mirror note. So the way we

link to this mirror note. So the way we are doing recursive chunking over here

are doing recursive chunking over here is that first we'll get the chunk right

is that first we'll get the chunk right and as I told you in recursive chunking

and as I told you in recursive chunking nothing should be greater than a certain

nothing should be greater than a certain chunk size which is 1,000 in this case.

chunk size which is 1,000 in this case. So we'll check if it is greater than

So we'll check if it is greater than thousand.

thousand. Uh if it is less than 1,000 then that

Uh if it is less than 1,000 then that becomes a full chunk. That's fine. If

becomes a full chunk. That's fine. If it's greater than thousand we'll first

it's greater than thousand we'll first chunk by double new lines. The second

chunk by double new lines. The second recursion is single new line and the

recursion is single new line and the final recussion is sentence. So if you

final recussion is sentence. So if you go to the recursive thing which we saw

go to the recursive thing which we saw in this example I had a section

in this example I had a section paragraph and sentence right in the code

paragraph and sentence right in the code file which I'm just showing all of you

file which I'm just showing all of you in the code file we have three sections

in the code file we have three sections in the code file we have double new line

in the code file we have double new line we have single new line and we have

we have single new line and we have sentence

these are the three sections. So first we'll do chunking at this level. If the

we'll do chunking at this level. If the number of characters are greater than

number of characters are greater than 500, we do chunking at this second

500, we do chunking at this second level. If it's still greater than 500,

level. If it's still greater than 500, we do sentence level chunking.

we do sentence level chunking. So that's where you'll see three

So that's where you'll see three sections.

sections. Um chunking at this double line, that's

Um chunking at this double line, that's the top level chunking. Then the second

the top level chunking. Then the second recursion is splitting by single new

recursion is splitting by single new line. Final recussion is splitting by

line. Final recussion is splitting by sentences.

sentences. So you can run this now and again

So you can run this now and again run the number of recursive chunks are 2

run the number of recursive chunks are 2 4 3 4. So definitely lesser than

4 3 4. So definitely lesser than semantic chunking.

semantic chunking. And here you will see the different

And here you will see the different chunks. The number of characters are 700

chunks. The number of characters are 700 780 800. They will definitely be less

780 800. They will definitely be less than 1,000 because that's the maximum

than 1,000 because that's the maximum chunk size which we have defined.

Oh. Oh yeah. So actually Amit this is a completely different notebook

completely different notebook because the LLM rag notebook which we

because the LLM rag notebook which we started out with that is a different

started out with that is a different notebook. We have covered only until

notebook. We have covered only until this part in that notebook. Tomorrow we

this part in that notebook. Tomorrow we will continue from this.

will continue from this. But this LLM chunking strategies is a

But this LLM chunking strategies is a new notebook which I think someone has

new notebook which I think someone has shared in the chat again.

Okay. So now we have done recursive chunking. Now I want to do structure

chunking. Now I want to do structure based junking and here we are going to

based junking and here we are going to do a small trick.

do a small trick. So and this is again comes to the

So and this is again comes to the engineer's decision right I have looked

engineer's decision right I have looked at these chapters and let me ask you

at these chapters and let me ask you this question I want to do a simple

this question I want to do a simple structural chunking right now where I

structural chunking right now where I want every one chapter to be one chunk

want every one chapter to be one chunk because at the start we asked chat GPT

because at the start we asked chat GPT how much total tokens was there right

how much total tokens was there right it's around um 1.4 4 million and if

it's around um 1.4 4 million and if there are 20 chapters each chapter will

there are 20 chapters each chapter will be maybe 20,000 30,000 tokens. So I want

be maybe 20,000 30,000 tokens. So I want one chapter to be one chunk.

one chapter to be one chunk. Now if you are given this problem that

Now if you are given this problem that you want one chapter to be one chunk.

you want one chapter to be one chunk. How will you do this? You have right now

How will you do this? You have right now um let me go to the table of contents.

Yeah this is the table of content right and I want each chapter here. Let's say

and I want each chapter here. Let's say this chapter I want the whole chapter to

this chapter I want the whole chapter to be one chunk. So this whole chapter is

be one chunk. So this whole chapter is one chunk. Then food quality is the

one chunk. Then food quality is the second chapter is the second chunk. That

second chapter is the second chunk. That seems the best thing to do over here.

seems the best thing to do over here. Right? Because anyways each chapter is

Right? Because anyways each chapter is very small. So why not have one chapter

very small. So why not have one chapter as one chunk. So this lifestyles and

as one chunk. So this lifestyles and nutrition will be one chunk.

nutrition will be one chunk. U then

U then achieving a healthy diet will be one

achieving a healthy diet will be one chunk. How will you do this? How will

chunk. How will you do this? How will you tell the PDF when the chapter starts

use chapter marker headings OCR so let's say we don't want to use

OCR so let's say we don't want to use OCR Samrat for this and even when you

OCR Samrat for this and even when you use OCR how will you do it

so one suggestion which many people have mentioned is exploit the

mentioned is exploit the uh

uh this font style

this font style because only the title will have this

because only the title will have this font. That way you'll know at what page

font. That way you'll know at what page you are or someone has mentioned go

you are or someone has mentioned go through the table of contents but that

through the table of contents but that will be a bit tricky.

will be a bit tricky. Use page number information from index.

Use page number information from index. That's also very interesting. Split PDF

That's also very interesting. Split PDF in pages. We are already splitting PDF

in pages. We are already splitting PDF in pages, right? But how will you split

in pages, right? But how will you split according to chapters?

according to chapters? We don't want one page to be one chunk.

We don't want one page to be one chunk. We want one entire chapter to be one

We want one entire chapter to be one chunk.

Average pages each chapter has. But then again you will lose out some

again you will lose out some information. So I'm going to do a simple

information. So I'm going to do a simple trick. What I'm going to do is that

trick. What I'm going to do is that wherever a new chapter starts, I have

wherever a new chapter starts, I have seen that there is this common text

seen that there is this common text which appears everywhere. University of

which appears everywhere. University of Hawaii and human nutrition program.

Hawaii and human nutrition program. Let's see if it appears at all chapters,

Let's see if it appears at all chapters, right?

Yeah. See, whenever a new chapter starts, this text is always available.

You will see that in all the chapters. Whenever a new

Whenever a new chapter starts, this is always there. So

chapter starts, this is always there. So what I'm going to do this is called as

what I'm going to do this is called as reg x methods.

reg x methods. I'm just going to use reg x is simple

I'm just going to use reg x is simple string matching. I'm simply going to

string matching. I'm simply going to match wherever I see that string.

match wherever I see that string. Uh wherever I see this string, I'm going

Uh wherever I see this string, I'm going to look at the text which is before that

to look at the text which is before that string and I'm going to start the chunk

string and I'm going to start the chunk from that place. That's it. This trick

from that place. That's it. This trick would work here. But how to go about it

would work here. But how to go about it in general? That is the thing right

in general? That is the thing right which I'm teaching you here. There is no

which I'm teaching you here. There is no general when it comes to when it comes

general when it comes to when it comes to industrial problems

to industrial problems because one industry might have data

because one industry might have data stored in a certain manner. Another

stored in a certain manner. Another industry might have data stored in

industry might have data stored in another certain manner. You need to

another certain manner. You need to develop the capability of smartly using

develop the capability of smartly using your intuition to know when that what

your intuition to know when that what thing to do at what time. In the class,

thing to do at what time. In the class, what I can do is I can teach you five

what I can do is I can teach you five chunking strategies. But in the actual

chunking strategies. But in the actual problem you might get a document where

problem you might get a document where you might need to use tricks like this

you might need to use tricks like this which are specific to that document

uh and we have used so many such tricks in all of our industrial problems. There

in all of our industrial problems. There is usually some special thing about each

is usually some special thing about each data. For example, some data might have

data. For example, some data might have a tabular structure on a certain page.

a tabular structure on a certain page. Some data might have a concluding point

Some data might have a concluding point section at the end of every chapter. You

section at the end of every chapter. You can exploit chapter related specific

can exploit chapter related specific information based on that piece of text

information based on that piece of text and all the other pieces of information

and all the other pieces of information other students in the chat gave right

other students in the chat gave right first going to the table of contents

first going to the table of contents getting the page number from there that

getting the page number from there that is a definitely doable trick.

is a definitely doable trick. Second is of course font style that you

Second is of course font style that you get the font style based on the title

get the font style based on the title and uh you can do it in some PDF

and uh you can do it in some PDF documents sections are digitally marked

documents sections are digitally marked as sections those sections dock link can

as sections those sections dock link can itself extract as sections and

itself extract as sections and subsections

subsections but if it's an OCR type of a thing where

but if it's an OCR type of a thing where nothing is digitally marked as a section

nothing is digitally marked as a section or subsection you might need to do

or subsection you might need to do tricks such as these.

Yeah. Right. So now what we are doing is that we are uh

yeah we are at this structural chunking and you'll see we are doing a simple reg

and you'll see we are doing a simple reg x here import re which is regular

x here import re which is regular expression or reg x and we are going to

expression or reg x and we are going to search for wherever this university of

search for wherever this university of Hawaii appears

Hawaii appears and what we are going to

Yeah. So here what we are doing is that we are doing a simple reg x search

we are doing a simple reg x search wherever this text university of Hawaii

wherever this text university of Hawaii appears really and then what we are

appears really and then what we are saying is that we are going to take the

saying is that we are going to take the text before the university of Hawaii

text before the university of Hawaii header line which is going to be our

header line which is going to be our title which is going to be this at the

title which is going to be this at the moment

moment and then we are going to make a full

and then we are going to make a full chunk from that that's it.

So this this code seems a bit long but the simple logic which we are

the simple logic which we are implementing is essentially wherever

implementing is essentially wherever this text comes right that is going to

this text comes right that is going to be my start of the chunk and wherever

be my start of the chunk and wherever this text comes again that is going to

this text comes again that is going to be my end of the previous chunk and

be my end of the previous chunk and start of the new chunk

start of the new chunk if the context size of the nutrition PDF

if the context size of the nutrition PDF was huge how was the chat GP able to

was huge how was the chat GP able to identify the context size with exact

identify the context size with exact numbers ideally it shouldn't have even

numbers ideally it shouldn't have even processed that that's a good question

processed that that's a good question And I think let's see the number of

And I think let's see the number of characters here are 14 85282.

So I think it's not an exact answer here.

here. It may be based on an approximation but

It may be based on an approximation but I doubt that these are actually correct

I doubt that these are actually correct results.

Okay. So now what we can do is that let's run this

let's run this and let us

and let us see the number of chunks. So the number

see the number of chunks. So the number of chunks is 171 because those are the

of chunks is 171 because those are the number of chapters which we have and we

number of chapters which we have and we can inspect every chunk to see whether

can inspect every chunk to see whether what we have done correctly or not. So

what we have done correctly or not. So see here this is chapter 42 which is

see here this is chapter 42 which is let's say lifestyles and nutrition.

let's say lifestyles and nutrition. Let's see whether that is correct or

Let's see whether that is correct or not. So go to the table of contents.

Go to the table of contents. Where is life? Lifestyle and nutrition.

Lifestyles and nutrition I think. Yeah. So see this

Yeah. So see this this is lifestyles and nutrition. And

this is lifestyles and nutrition. And when I click on this

when I click on this Yeah. See this is our first chunk.

In addition to nutrition, health is affected by genetics etc. In addition to

affected by genetics etc. In addition to nutrition, health is affected by

nutrition, health is affected by genetics, the environment. So now we

genetics, the environment. So now we have retrieved this is one chunk which

have retrieved this is one chunk which is that chapter. Then let's see whether

is that chapter. Then let's see whether this is correct or not. Another chunk

this is correct or not. Another chunk seems to be a chapter named

seems to be a chapter named phytochemicals.

Yeah. So this is also a chapter which starts with phytochemicals or chemicals

starts with phytochemicals or chemicals in plants that may provide some health

in plants that may provide some health benefit.

benefit. Phytochemicals or chemicals in plants

Phytochemicals or chemicals in plants that may provide some health benefit.

that may provide some health benefit. Right? So in fact every single chunk if

Right? So in fact every single chunk if you see is one chapter which we have now

you see is one chapter which we have now very smartly deconstructed over here.

very smartly deconstructed over here. Um, and one thing you'll immediately

Um, and one thing you'll immediately notice is that the number of tokens in

notice is that the number of tokens in each chapter are different because of

each chapter are different because of course the number of pages in each

course the number of pages in each chapter are different.

Um, so and the number of chunks are also

so and the number of chunks are also much lesser. We only have 171 chunks

much lesser. We only have 171 chunks here. But the size variance across

here. But the size variance across chunks is very large. But you see with a

chunks is very large. But you see with a simple trick which we did here, we were

simple trick which we did here, we were able to do structure based chunking.

able to do structure based chunking. This thing which I did here when you are

This thing which I did here when you are faced with an industry problem this is

faced with an industry problem this is exactly what you will need to do for

exactly what you will need to do for your given problem because your data

your given problem because your data might be different. Your trick the the

might be different. Your trick the the trick which you implemented right now is

trick which you implemented right now is not scalable. Your trick might be

not scalable. Your trick might be something different

because you're looking at the start and end of university. Wouldn't you be

end of university. Wouldn't you be missing the topic headers? Yeah, we will

missing the topic headers? Yeah, we will be missing. We can add those manually

be missing. We can add those manually later.

later. What about images and tables in this

What about images and tables in this PDF? Are we handling them? So, images

PDF? Are we handling them? So, images and tables are there are no tables.

and tables are there are no tables. There are some tables in this PDF, but

There are some tables in this PDF, but they are treated as images here. And we

they are treated as images here. And we are only dealing with text for now. I'll

are only dealing with text for now. I'll show you tomorrow how to deal with

show you tomorrow how to deal with multimodel data and how to store those

multimodel data and how to store those as embeddings.

as embeddings. This code will not work for other PDFs.

This code will not work for other PDFs. How about making things general with

How about making things general with multiple PDFs? So with multiple PDFs

multiple PDFs? So with multiple PDFs right what you will need to do samarat

right what you will need to do samarat is only

is only like let's say for example if you have a

like let's say for example if you have a PDF right u a tool like dock

PDF right u a tool like dock will automatically generate sections and

will automatically generate sections and subsections it will identify sections

subsections it will identify sections and subsections within that pdf so then

and subsections within that pdf so then you don't need something as complex as

you don't need something as complex as what I did right now the reason I showed

what I did right now the reason I showed you this code is because you can custom

you this code is because you can custom you can also do custom structural chunks

you can also do custom structural chunks which

which very much needed in industry. If you

very much needed in industry. If you want to do a simple PDF project, dock

want to do a simple PDF project, dock link can even identify tables, it can

link can even identify tables, it can identify headings, uh sections,

identify headings, uh sections, subsections, etc.

We are only outputting some tokens. Yeah. So, we are randomly outputting

Yeah. So, we are randomly outputting some tokens from each chunk. The one

some tokens from each chunk. The one chunk is full chapter in this case. So,

chunk is full chapter in this case. So, I'm just outputting some random

I'm just outputting some random uh the start information from each

uh the start information from each chunk.

The last thing which we have to do is LLM based junking. Now this is where I

LLM based junking. Now this is where I think that if you scroll down into the

think that if you scroll down into the Google collab notebook which I have

Google collab notebook which I have shared with all of you.

shared with all of you. Uh

yeah the open API key this you will need to enter from your side because here

to enter from your side because here what we are going to do is that we are

what we are going to do is that we are going to ask an LLM to create chunk

going to ask an LLM to create chunk boundaries for us. So if you check the

boundaries for us. So if you check the prompt, we are going to tell the LLM to

prompt, we are going to tell the LLM to analyze the following text and identify

analyze the following text and identify the best point to split it between two

the best point to split it between two semantically coherent coherent parts.

semantically coherent coherent parts. That's it. So here humans are kind of

That's it. So here humans are kind of offloading everything to the LLM. Uh and

offloading everything to the LLM. Uh and we are hoping the LLM takes care of

we are hoping the LLM takes care of this. So I ran this right now and this

this. So I ran this right now and this portion will take some time uh because

portion will take some time uh because we are processing through 128 pages,

we are processing through 128 pages, right? So this portion is going to take

right? So this portion is going to take a bit of time. It is going to take a

a bit of time. It is going to take a long time actually. So I'll pause this

long time actually. So I'll pause this running for now. It might take around uh

running for now. It might take around uh I think 9 to 10 minutes.

I think 9 to 10 minutes. Let's see. Meanwhile, uh if there are

Let's see. Meanwhile, uh if there are any questions in the chat, we should

any questions in the chat, we should remove Yeah, we should remove the

remove Yeah, we should remove the University of Hawaii. That's a good

University of Hawaii. That's a good point. We should remove this for sure

point. We should remove this for sure and we should take the title into

and we should take the title into account. But those things can be done

account. But those things can be done manually.

There is a question I think about how do we deal with multiple documents.

we deal with multiple documents. So there is also something called cross

So there is also something called cross document chunking where you can the

document chunking where you can the simplest thing to do is that if you have

simplest thing to do is that if you have multiple documents related to a topic

multiple documents related to a topic just add all of that into the knowledge

just add all of that into the knowledge base and follow the same pipeline which

base and follow the same pipeline which we are doing right now for multiple

we are doing right now for multiple documents.

Do we have open source LLMs for this? We do have open source LLM for this. In

We do have open source LLM for this. In fact, tomorrow what we are going to do

fact, tomorrow what we are going to do is that the if you look at the first

is that the if you look at the first code pipeline which I shared with you uh

code pipeline which I shared with you uh this code pipeline

we will we are going to run this end to end tomorrow. And here we are going to

end tomorrow. And here we are going to use an opensource LLM.

use an opensource LLM. So we are going to run it on our own

So we are going to run it on our own GPU. Overall I want to show you closed

GPU. Overall I want to show you closed source as well as open source LLM. For

source as well as open source LLM. For closed source LLM what we are doing

closed source LLM what we are doing right now the open AI approach is fine

right now the open AI approach is fine but for open source LLM things can get

but for open source LLM things can get slightly more difficult

slightly more difficult but a local rack pipeline

but a local rack pipeline I hope all of you can understand why a

I hope all of you can understand why a local rack pipeline will be useful

local rack pipeline will be useful because some companies don't want their

because some companies don't want their data to be sent privately or they want

data to be sent privately or they want their data stored privately they want

their data stored privately they want they don't want their data to be sent to

they don't want their data to be sent to another API call let's

Does LLM chunking help reconcile if any logical differences when multiple

logical differences when multiple sources are used? No. No, not really.

sources are used? No. No, not really. Because what LLM chunking does is that

Because what LLM chunking does is that if you do semantic chunking, then some

if you do semantic chunking, then some coherency is there across different

coherency is there across different chunks. But structural chunking, fixed

chunks. But structural chunking, fixed size chunking, they don't uh take into

size chunking, they don't uh take into account semantic context at all.

Just a quick note. Yeah, the API file we can also store it in the secret keys

can also store it in the secret keys actually which is a much better practice

actually which is a much better practice to do but for the sake of simplicity I

to do but for the sake of simplicity I have just added it over here.

How do we mask PIS here? Yeah, that's

PIS here? Yeah, that's again that would come in the named

again that would come in the named entity recognition part, right?

entity recognition part, right? So let's say for example uh if you have

So let's say for example uh if you have a document so by PII I I think you mean

a document so by PII I I think you mean personal identification information

personal identification information right if there are documents when where

right if there are documents when where you need to mask PIS

you need to mask PIS it's a good question so someone had even

it's a good question so someone had even asked a named entity recognition

asked a named entity recognition question at the start of this lecture

question at the start of this lecture the way to do named entity recognition

the way to do named entity recognition is again to make uh

is again to make uh structures in the document in a very

structures in the document in a very clever format so what if you can use

clever format so what if you can use another LLM to detect places where PIIs

another LLM to detect places where PIIs are observed and then store that and

are observed and then store that and then only extract information relevant

then only extract information relevant to that. One tool which can do exactly

to that. One tool which can do exactly this is called as lang extract. How many

this is called as lang extract. How many of you are aware of this tool?

of you are aware of this tool? This tool was recently released by

This tool was recently released by Google. I it does exactly the same thing

Google. I it does exactly the same thing what you're asking. It looks for

what you're asking. It looks for specific

specific things in a piece of text and extracts

things in a piece of text and extracts only that information using large

only that information using large language models.

language models. We are right now testing this on an

We are right now testing this on an industrial application actually. Uh it's

industrial application actually. Uh it's it's it has some issues but it's an

it's it has some issues but it's an amazing tool where basically what you

amazing tool where basically what you can do is that you can extract

can do is that you can extract structured information from even

structured information from even unstructured text documents.

Um and you use an LLM to do this. So if you want PII or personal identification

you want PII or personal identification information, you can just pass that as

information, you can just pass that as input source to lang extract and then

input source to lang extract and then you retrieve only that relevant

you retrieve only that relevant information

information and then you can mask it very easily.

and then you can mask it very easily. Once you retrieve that information,

Once you retrieve that information, masking is not going to take that much

masking is not going to take that much time.

time. If you're dealing with legal files, not

If you're dealing with legal files, not just legal files, if for any any uh

just legal files, if for any any uh let's say if you want to make an other

let's say if you want to make an other application, right? If you want to make

application, right? If you want to make an adar rack

an adar rack um system, you need to know important

um system, you need to know important information related to a person which

information related to a person which can be their name, date of birth,

can be their name, date of birth, address and that can be scattered

address and that can be scattered throughout the document.

throughout the document. How do you retrieve only that portion

How do you retrieve only that portion which matters? There have been natural

which matters? There have been natural language processing models developed

language processing models developed specifically for named entity

specifically for named entity recognition. You can use those for sure

recognition. You can use those for sure but now you can even use generative

but now you can even use generative models. People who have been attending

models. People who have been attending from the live lectures at uh from day

from the live lectures at uh from day one now know the difference between

one now know the difference between the there's a trade-off when you use

the there's a trade-off when you use generative models right you cannot

generative models right you cannot randomly use it for all applications in

randomly use it for all applications in this LLM evolutionary tree this gray

this LLM evolutionary tree this gray side is the LLM part the generative

side is the LLM part the generative model part usually more expensive but

model part usually more expensive but this red tree are representative models

this red tree are representative models and they can even do many more tasks

and they can even do many more tasks uh in a much cheaper price in much

uh in a much cheaper price in much lesser certain number of parameters. So

lesser certain number of parameters. So always keep this trade-off in mind.

always keep this trade-off in mind. Don't randomly use a language model.

Don't randomly use a language model. Let's say for named entity recognition,

Let's say for named entity recognition, you can even do a simple. So for example

yeah I'm sure hugging face has huge number of models to do any right like

number of models to do any right like this is a bird-based named entity

this is a bird-based named entity recognition fully open source has around

recognition fully open source has around 2 million downloads per month looks to

2 million downloads per month looks to be highly robust

be highly robust uh which is the best tool to extract

uh which is the best tool to extract information from handwritten purchase

information from handwritten purchase purchase orders. So I cannot recommend

purchase orders. So I cannot recommend directly like that Amit because as we

directly like that Amit because as we have discussed so far it's the

have discussed so far it's the engineer's choice right

engineer's choice right but hopefully I've given you enough

but hopefully I've given you enough tools to make that choice yourself now

tools to make that choice yourself now because if the handwritten purchase

because if the handwritten purchase order is there of course you you will

order is there of course you you will need to use an OCR tool but if you want

need to use an OCR tool but if you want to use if it's a complex data you can

to use if it's a complex data you can use dockling but again we'll need to

use dockling but again we'll need to look at data you can even use lang

look at data you can even use lang extract

extract uh which I just showed you right

Okay. So let's see how many of you are running this by the way this LLM based

running this by the way this LLM based chunking and how many of you are not

chunking and how many of you are not running this.

running this. Can you just small brief meaning of

Can you just small brief meaning of named oh ne is actually named entity

named oh ne is actually named entity recognition. So named entity recognition

recognition. So named entity recognition is named entity can be a name of a

is named entity can be a name of a person can be address of a person named

person can be address of a person named entities if you need to identify from a

entities if you need to identify from a document that's called named entity

document that's called named entity recognition.

recognition. So it's like if you want to extract

So it's like if you want to extract specific entities like by entities it

specific entities like by entities it can be a personal ident identity

can be a personal ident identity information anything else.

So okay so I can see that many people have finished running this actually I'm

have finished running this actually I'm also almost done over here but you can

also almost done over here but you can see uh that in this document

whenever I mention about LLM based chunking I mentioned the trade-off of

chunking I mentioned the trade-off of computational size right and expense

computational size right and expense computational expense already you must

computational expense already you must be observing that all previous chunking

be observing that all previous chunking methods ran in a fraction of a second

methods ran in a fraction of a second but the moment we used a generative

but the moment we used a generative model with a trillions of parameters

model with a trillions of parameters It takes huge amount of time and right

It takes huge amount of time and right now since we are doing it on a Google

now since we are doing it on a Google collab it's fine but for an industrial

collab it's fine but for an industrial project this time can be extremely

project this time can be extremely prohibitive to the client in terms of

prohibitive to the client in terms of cost.

Remember every call which we are making is charged. So now we have 2360 chunks

is charged. So now we have 2360 chunks over here right and we can even print

over here right and we can even print out these different chunks how they look

out these different chunks how they look like etc. Again since it's LLM based

like etc. Again since it's LLM based chunking we are at the mercy of the

chunking we are at the mercy of the language model to decide where to make

language model to decide where to make splits. But this seems to be much better

splits. But this seems to be much better than semantic chunking. Why? Because in

than semantic chunking. Why? Because in semantic chunking we used a very simple

semantic chunking we used a very simple model. Remember we used a model which

model. Remember we used a model which has around 1,000 times lesser parameters

has around 1,000 times lesser parameters or 10 to five times lesser parameters

or 10 to five times lesser parameters than the GPT model which we used

than the GPT model which we used for LLM based junking. So of course

for LLM based junking. So of course since we used a larger model it has to

since we used a larger model it has to do a better job. So here it has

do a better job. So here it has converted this into 2360 chunks which is

converted this into 2360 chunks which is much better than the 12,000 chunks which

much better than the 12,000 chunks which we saw in the case of semantic chunking.

Uh the LLM evolutionary tree someone has asked right so if you can just search

asked right so if you can just search there is a GitHub repository

there is a GitHub repository was being maintained practically it was

was being maintained practically it was being maintained until a certain amount

being maintained until a certain amount of time that's where you'll have the

of time that's where you'll have the graph

graph anyway. So now the five chunking

anyway. So now the five chunking strategies are done. And here is where

strategies are done. And here is where we can actually do the evaluation of

we can actually do the evaluation of these chunking strategies. This is going

these chunking strategies. This is going to be a simple statistical evaluation.

to be a simple statistical evaluation. But you can actually see the average

But you can actually see the average chunk size. And here I'm looking at the

chunk size. And here I'm looking at the number of words in a chunk.

number of words in a chunk. So this is just average chunk size is 62

So this is just average chunk size is 62 words. Number of chunks is 33 to1. And

words. Number of chunks is 33 to1. And the size variance across the chunks.

the size variance across the chunks. So just take a look at the different

So just take a look at the different things here. The first thing which you

things here. The first thing which you clearly observe is that the number of

clearly observe is that the number of chunks in semantic chunking are huge.

chunks in semantic chunking are huge. They're much higher compared to all the

They're much higher compared to all the other types. And the number of chunking

other types. And the number of chunking in structural number of chunks in

in structural number of chunks in structural chunking is much lesser

structural chunking is much lesser because we are taking each page as one

because we are taking each page as one chunk. Sorry, each chapter as one chunk

chunk. Sorry, each chapter as one chunk if you remember. So every chapter is one

if you remember. So every chapter is one chunk. That's why there are 17 only 171

chunk. That's why there are 17 only 171 chunks. But in semantic chunking we used

chunks. But in semantic chunking we used a transformer model which is the open

a transformer model which is the open source from hugging phase. So uh it made

source from hugging phase. So uh it made very granular chunks. Then take a look

very granular chunks. Then take a look at size variance. Fix chunking is

at size variance. Fix chunking is extremely compact. Semantic chunking is

extremely compact. Semantic chunking is also quite compact it seems because why

also quite compact it seems because why semantic chunking compact? Because there

semantic chunking compact? Because there is no one stretch of paragraph where

is no one stretch of paragraph where semantics is apparently retained. But

semantics is apparently retained. But structure based chunking there is huge

structure based chunking there is huge size variance because

size variance because there is a lot of difference between the

there is a lot of difference between the chapter page length

chapter page length and LLM based chunking if you see it

and LLM based chunking if you see it always provides a moderate performance.

always provides a moderate performance. It provides balanced results which might

It provides balanced results which might be indicative that this is a good thing

be indicative that this is a good thing to go ahead with but again it's

to go ahead with but again it's extremely expensive.

extremely expensive. So in this particular case just let's

So in this particular case just let's say you are doing this problem for a

say you are doing this problem for a nutritional chatbot we have not

nutritional chatbot we have not implemented the rack pipeline but

implemented the rack pipeline but already we know which methods not to use

already we know which methods not to use right

right we should probably not use LLM based

we should probably not use LLM based method because it takes a lot of time

method because it takes a lot of time uh we should probably not use fixed

uh we should probably not use fixed chunking because it's it's it's not the

chunking because it's it's it's not the ideal thing to do is break according to

ideal thing to do is break according to chapters

chapters and semantic chunking is also not very

and semantic chunking is also not very good because the number of chunks which

good because the number of chunks which are been given to us are very large. So

are been given to us are very large. So we might do a structural chunking or we

we might do a structural chunking or we might do fixed chunking with let's say

might do fixed chunking with let's say more number of sentences grouped

more number of sentences grouped together etc. These are the kind of

together etc. These are the kind of insights on chunking evaluation which

insights on chunking evaluation which you can get at this stage. Um

you can get at this stage. Um yeah we should definitely benchmark

yeah we should definitely benchmark time.

time. Uh recursive chunking again is a good

Uh recursive chunking again is a good trade-off between all the things if you

trade-off between all the things if you check and that's always the case.

check and that's always the case. Recursive chunking time is also good.

Recursive chunking time is also good. Size variance is less. Number of chunks

Size variance is less. Number of chunks are moderate. Average chunk size is also

are moderate. Average chunk size is also moderate

moderate because as I said, it's the best of both

because as I said, it's the best of both worlds, right? It does not have too

worlds, right? It does not have too large of chunk sizes and at the same

large of chunk sizes and at the same time, it maintains the structure.

time, it maintains the structure. Finally, you can visualize these things

Finally, you can visualize these things which we just saw for the chunk size.

which we just saw for the chunk size. Structure based chunking of course has

Structure based chunking of course has the largest chunk size and semantic

the largest chunk size and semantic chunking has the smallest number of

chunking has the smallest number of chunks. It becomes reverse over here.

chunks. It becomes reverse over here. Semantic chunking has the largest number

Semantic chunking has the largest number of chunks. Structure based chunking has

of chunks. Structure based chunking has the smallest. Recursive and LLM are

the smallest. Recursive and LLM are moderate. And chunk size variance for

moderate. And chunk size variance for structure based chunking it's way higher

structure based chunking it's way higher than the rest.

than the rest. You can even do a box plot of the

You can even do a box plot of the um sizes. Right? So for structure you

um sizes. Right? So for structure you can see the variance in the chunk size.

can see the variance in the chunk size. So final takeaway here is that structure

So final takeaway here is that structure based chunking produces the largest

based chunking produces the largest chunk fewer in number but with very high

chunk fewer in number but with very high variance.

variance. It is best for capture capturing entire

It is best for capture capturing entire sections less balance for downstream

sections less balance for downstream models because of the size variance.

models because of the size variance. Semantic chunking produces very small

Semantic chunking produces very small chunks and the highest number of chunks.

chunks and the highest number of chunks. It preserves fine grain context but it

It preserves fine grain context but it risks over fragmentation which means

risks over fragmentation which means chunk size is too small that we are

chunk size is too small that we are really not capturing any meaningful

really not capturing any meaningful information in one chunk.

information in one chunk. Fixed size chunking produces consistent

Fixed size chunking produces consistent moderate chunks with low variance

moderate chunks with low variance but it ignores semantic boundaries.

but it ignores semantic boundaries. Recursive chunking as always recursive

Recursive chunking as always recursive and LLM chunking are balanced

and LLM chunking are balanced approaches. But I I actually did one

approaches. But I I actually did one mistake here. I should have benchmarked

mistake here. I should have benchmarked time also. If I would have benchmarked

time also. If I would have benchmarked time, LLM would definitely be out of the

time, LLM would definitely be out of the picture. It's two orders of magnitude

picture. It's two orders of magnitude slower than other chunking methods

slower than other chunking methods actually.

Yeah. So this this brings us to the end of

of day one of this workshop

day one of this workshop uh where we actually covered let me do a

uh where we actually covered let me do a quick summary of what all we have

quick summary of what all we have covered.

covered. I again want to thank all of you for

I again want to thank all of you for staying till the end. And I had

staying till the end. And I had originally planned it to be a 1 and a

originally planned it to be a 1 and a half hour session but it actually became

half hour session but it actually became a 3 and 1/2 hour session. So we have

a 3 and 1/2 hour session. So we have covered file parsing and we have covered

covered file parsing and we have covered chunking so far and still we have to

chunking so far and still we have to cover embeddings which we'll cover

cover embeddings which we'll cover tomorrow. Then we have to cover

tomorrow. Then we have to cover evaluation

evaluation and then I want to show you how to build

and then I want to show you how to build this production level rack pipeline.

Uh thanks a lot everyone uh for staying for three and a half hours. I think many

for three and a half hours. I think many of you actually stayed. Um if you like

of you actually stayed. Um if you like today's lecture definitely attend

today's lecture definitely attend tomorrow because tomorrow we'll directly

tomorrow because tomorrow we'll directly take off from where we left. We'll

take off from where we left. We'll directly so this chunking thing has

directly so this chunking thing has finished. Now we'll start with

uh embedding.

embedding. So we have embedding remaining. We have

So we have embedding remaining. We have retrieval remaining. Then we have

retrieval remaining. Then we have generation remaining and finally we have

generation remaining and finally we have the production level system remaining.

the production level system remaining. Thanks everyone. Then I look forward to

Thanks everyone. Then I look forward to seeing you tomorrow.

seeing you tomorrow. All right.

点击任意文字或时间戳，即可跳转到视频对应位置

大多数字幕 5 秒内即可准备好

一键复制125+ 种语言搜索内容跳转到时间戳

粘贴 YouTube 链接

输入任意 YouTube 视频链接，获取完整字幕

大多数字幕 5 秒内即可准备好

安装 Chrome 扩展

无需离开 YouTube，一键获取视频字幕。安装我们的 Chrome 扩展，直接在视频页面访问任意视频的完整字幕。

免费添加到 Chrome

支持 YouTube、Coursera、Udemy 等主流教育平台

快速获取字幕：直接修改地址栏中的域名即可！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 字幕正在为您准备结果……

YouTube 字幕：Production level RAG Workshop: Part 1