Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
#8 Data Preprocessing In Data Mining - 4 Steps |DM| | Trouble- Free | YouTubeToText
YouTube Transcript: #8 Data Preprocessing In Data Mining - 4 Steps |DM|
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
hello everyone welcome back to my
youtube channel trouble free in this
video i'm going to explain you about the
data pre-processing
in the subject of data mining so basically
basically
what is data pre-processing what are the
steps involved in data pre-processing
and in those steps also we have some sub
steps everything i am going to explain
in this video so this video is gonna
become a bit longer i don't want to
divide it into sub parts just because of
the reason that i have created the
thumbnails already so i have to again uh
you know edit the numbers and the
thumbnails and all so i'm becoming a bit
lazy in that excuse me for that so let's
get into the video now
so data preprocessing is nothing but
the process of transforming or you can
say converting
raw data into an understandable format
okay suppose if you are having uh
uh
the marks of a students marks of 60
students in data mining subject
then all the names of the students will
be like a b c and so on up to z
and the marks will be in this way
1991 and so on up to 100 so the names of
the students are separate the marks of
the students are separate all of them
are in a some random format then can you
understand which person got how many
marks no right so that is what raw data
means an understandable format is
nothing but you divide you make it as a
table or you make it as a chart or you
make it as a graph whatever it is so
that the data can be understood okay
that is what understandable format so
the process of converting the raw data
into an understandable format is called
data preprocessing got it in data
pre-processing we actually have four
steps okay data cleaning data
integration data reduction and data
transformation these are the four steps
we have in data pre-processing
got it so let us in detail learn about
each and every step now okay first step
is the data cleaning so in data cleaning
what will happen is it is a process of
removal of incorrect incomplete
okay inaccurate data it also replaces
the missing data if there is any
incorrect data or if there is any
incomplete data or inaccurate data or
inconsistent data or any error in the
data whatever it is
so those data can be removed and also it
will replace the missing values that is
in case of missing values if there are
any empty spaces in those spaces you can
add the values got it that is about data cleaning
cleaning
okay so in data cleaning we have two
things actually okay as i already said
you will have handling missing
values and handling
handling
noisy data so missing values is nothing
but empty spaces noisy data is nothing
but this incorrect or incomplete or in
accurate or error data whatever it is
will come under noisy so how to handle
missing values how to handle noisy data
i'll tell you first
first
handling uh missing values right so in
case of handling missing values you can
do in many ways like sorry
you can
replace it with n a that is not
applicable or in a you can write in
place of missing value
or you can replace it with the mean
value okay in case of normal
distribution you can use this in case of
if if the data is normally distributed
in that case you can replace with the
mean value okay the mean value in the
sense whatever remaining data is there
apart from the missing data all the data
you have to calculate the mean and with
that mean you can replace got it next
median values you can replace with the
median values as well when you can
replace with median values in case of
non-normal distribution if the data is
normally distributed in that case you
can replace with median if it is normal
you can replace it with mean got it this
is about handling missing values and we
have some more don't worry sometimes you
can also replace them with the most
probable values that is the values uh
which can occur most probably got it
that is
there is a high chance for that value to
occur okay and missing values can
actually be filled in two ways actually
have to say it in the beginning but i
forgot manual automatic manual in the
sense you can use it only for small data
that is manually you will be filling you
will be identifying the empty spaces and
in that empty spaces you will be filling
the data but this will work fine only
for small data sets got it and next is
the automatic automatic is more
efficient when compared to manual
obviously and it suits for large data
sets got it so after this we are done
with handling missing values right now
we have handling noisy data so in
handling noisy data
noisy data is nothing but inconsistent
or error data we have several methods to
do it we actually have three methods
first is binding
billing is new one for you and next is
regression and clustering they are not
new for you okay so binding what you
will do is first you will sort the data
okay along with the error values only
you will be sorting the data
once the data is sorted you will be
storing those data into bins okay bins
you will be creating bins and you will
be creating the data into bins with the
stored data which is there you will be
storing them into the bins once you
store the data into bins what you will
do is
you will be doing the smoothing process
smoothing process is nothing but
removing the error values or replacing
the error values got it and this
smoothing process also can be done in
three ways mean median boundary okay
i'll tell you what is mean median
boundary now don't worry so first in
case of mean what you will do is
the values which are present in the bin
are replaced by the mean of mean value
of the bin suppose if 2 3
4 5 are there in the bin okay and now
four is the error value so what is the
average of this bin now two plus three
plus by four two plus three is five five
plus five
uh ten fourteen fourteen by four will
give you something around
three point something right so that
three point something will be replaced
in place of all these values okay
okay like that you will be replacing it
with mean in case of
smoothing by bin mean in case of
smoothing by bin median method you will
be replacing with the help of the median
you know what is median right when you
arrange the data from mean median mode
in statistics we know but still i'll
tell you when you arrange the data in
the ascending order ascending order yeah
ascending order that is small to be in
any order like i think only ascending
okay i'm sorry for that when you arrange
the data in a particular order in
ascending order or descending order then
whichever value is in the middle of the
uh order data set that is called as a
median like we have one two three four
five the data isn't sorted we write here
three is a median because three is in
the middle of the list one two
four five two places two places it is in
the middle right so that is why you
replace with three okay that is about
bin media next comes a bin boundary
boundary means what mean
and max values you will be replacing it
with the min and max values that's
simple okay this is about binning first
you will sort the data you will store
that sorted data into bins and then you
will be applying smoothing any of this
smoothing method you can apply got it next
next
regression regression is nothing but
numerical prediction of data so what is
regression about regression everything
in data you will be learning in the next
coming videos don't worry you can just
write numerical prediction of data and
leave it in case of exam in case of this
data preprocessing question next comes
the clustering clustering also i have
already explained
like similar data items similar things
are grouped into one cluster and
whatever dissimilar items are there they
are thrown out of the cluster so that
the dissimilar items are nothing but the
error items so that you can easily
remove the error items got it this is
about the clustering okay so this is
about data cleaning in data cleaning we
have two things handling missing data
and handling noisy data noisy data is
nothing but the error data okay in
handling missing data we don't have any
sub categories but in case of handling
noisy data again we have three things
binning regression clustering okay next
is data integration i said the video is
going to be very long so data cleaning
itself took six minutes or seven eight
minutes and so
data integration next next coming ones
will be taking
i don't think so they'll take more time
okay that's okay let's go with the flow
next data integration right so data
integration is nothing but
you will be integrating the data into a
single data set from the multiple
sources multiple heterogeneous sources
is nothing but mul different different
types of sources different different
types of data you'll take okay
homogeneous means everything uniform
same right heterogeneous means different
different types of data you can take
numbers you can combine numbers
words alphabets symbols words whatever
it is you want you can combine
heterogeneous sources of data are
combined into a single data set got it
this is data integration okay and in
data integration also it can be done in
two ways
okay that is tight coupling and lose
coupling so what do you mean by tight
coupling data is combined together into
a physical location that is suppose you
have data source a and data source b now
what happens is you will be combining
both a and b and you will be storing it
into a separate physical location called
as c that is when you come back to a or
when you come back to b separately if
you want to access to a or b you cannot
do that in case of tight coupling okay
once the data is integrated once you
have combined the data
you cannot again separately have access
to the data in case of tight coupling
whereas in case of loose coupling what
happens is the data is actually not
integrated okay only an interface will
be created and data is combined through
that interface and also access to that
interface that is
like a cloud kind of thing you can
imagine like the data is not actually
combined so here what you can do in case
of loose coupling is you can have access
to the combined data you can have access
to the individual data as well because
you are not physically combining the
data you are combining the data only
through an interface got it so
so
like based on it it happens dynamically
we can say like you know if you are
asking for some mining a query then
based on your query it will then and
there itself combine the data and give
you the result okay like that okay that
is about data integration the word
integration itself says you're combining
something done after data integration we
have data reduction so what do you mean
by data reduction see actually if you
are having large and large amount of data
data then
then the
the
analysis of the data will become hard
right so
searching from 10 members is easy or
searching from thousand members is easy
obviously 10 members right so if the
volume of data is very high then the
performance also will be low so for that reason
reason
in data reduction what you will be doing
is the volume of the data is reduced in
order to make the analysis easier okay
so data is reduced and you can do it in
two ways you will do the lossy and the
lossless lossy means some of the data
will be lost last less means
you will not nothing no data will be
lost everything will be as it is but the
data will be compressed like we use
online compressors right pdf compressor
we will be using if sometimes uh if if
it is not supporting if it has to be
maximum 2 mb or 1mb it will be like in
some of the websites where we are
uploading something we'll be using
compressors right so the same way
volume of the data will be reduced in
order to make the analysis easier okay
and here we have several methods and
data reduction i'll tell you what are
those first one is the dimensionality
reduction in dimensionality reduction
what happens is
it will reduce the number of input variables
variables
okay the number of input variables in
the data set is reduced so that
obviously automatically the data which
is associated with those um input
variables also is reduced and
performance will be increased if there
are large number of input variables
obviously dependencies also will be more
right dependencies in the sense one
variable depending on the other variable
dependencies will be more the data will
be more so once you remove the uh
input variables once you reduce the
input variables then the dependencies
will be reduced along with the data also
will be reduced so that is why
dimensionality that is what
dimensionality reduction is next
next
data cube aggregation in data cube
aggregation what you will do is you will
be combining the raw data that is
individual pieces of data
will be combined together to construct a
data cube okay i've already explained
about what data cube is the first or
second video i guess so you will be
creating a data cube then
whatever data is there with that data
only we are creating the data cube right
and then how data is reduced here
the redundant data that is the duplicate
data repeating data or the noising data
if it is present
that will be removed from the
data and a unique data cube will be
generated that is about data cube
aggregation next comes the attribute
subset selection so here what happens is
you will have so many attributes
attributes are nothing but the columns
okay you'll have so many columns in the
uh data
or in a table or in a data warehouse or
in a data mining system you will have so
many tables sorry so many columns
associated with a single table right
attributes are nothing but columns
okay so highly relevant attributes
should be used others should be
discarded that is others should be deleted
deleted removed
removed
so whatever are highly relevant in
essence related to the data or whichever
are highly important only that data
should be used other data has to be
removed from the database got it so in
this way also data can be reduced this
is what called as the attribute subset
selection next is numerosity reduction
in data reduction only the fourth method
here we store only model of data instead
of the entire data instead of storing
the entire data only model that is
sample of the data so that if we test on
this data or if we do any operations on
this data it will be you know for
example in our college only during lab
exams or during project submissions what
they will do they will get so many
records 60 to 65 records they will get
per class right
per section depending on the strength of
the section we'll restore all the
records no right they will store only
five to six records just for reference
for the next year or for inspection or
so they will not restore everything so
here also instead of storing the entire
data they will store only sample or the
model of the data
got it that is about numerous reduction
so for this we have completed data reduction
reduction
okay next is data transformation so in
data transformation you already know
what is data transformation you will be
transforming the data into appropriate
form which is suitable for the data
mining process like you cannot just
randomly go and do the data mining
process from a raw data like abc if the
data is arranged in form of
comma separated values or uh you know
you cannot just randomly go and do data
mining operations on with whatever data
you want right it has to be suitable
format so that will be done by the data
transformation right the data
transformation step will be transforming
the data into suitable format and that
also we have four methods again
normalization so normalization is done
in order to scale the data values in a
specified range so this is not
applicable for everything because they
you cannot
scale everything from a range of 0 to 1
or 1 to negative 1 to positive 1 right
sometimes you'll have name sometimes
you'll have sections sometimes will have
different different things it is not
possible to always scale the data so
whenever possible you can use this
normalization if you want to arrange the
data in a specified range you can go for
normalization got it next after
normalization we have attribute selection
selection
that is you will be creating new
attributes by using the older ones by
using the older attributes you will be
creating the new attributes got it that
is attribute selection simple next comes
the discretization so in discretization
raw values are replaced by interval
discretization the word itself says raw
values in the sense
uh suppose
you have values like 10
12 13 14 21 22 34 like 36 like this so
instead of 10 12 13 14 these will be
replaced like from 10 to 20
these will be replaced like from 20 to 30
30
30 to 40 like that instead of raw values
you'll be getting the you'll be
generating the intervals for that okay
next is concept hierarchy generation
that is you are converting the
attributes from low level to high level
that is the city is an attribute let us
take city is an attribute from city you
are generating country you are you you
know you are
converting city into countries city is
what actually an attribute country is
also an attribute but city is a low
level attribute whereas country is a
high level attribute got it that is the
difference between the city and country
okay so concept hierarchy means you will
be converting the low level attributes
into high level attributes that's all so
this is all about this data
pre-processing uh so i'm done with the
video i know the video is long you guys
and it is very hard for you people to
remember all these side headings as well
i understand but still i try to make it
as simple as i can so that's all that's
all for this video let's meet up in the
next coming video with another topic
till then if you're still having any
doubts just let me know in the comment
section i'll be very happy to clear your
doubts if i can and all the best for
your exams thanks for watching the video
till the end and i have started a new
channel about study abroad content if
you're interested have a look at the
channel i'll give the link of the
channel in the description let's meet
ups in the next coming video with
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.