0:08 hello everyone welcome back to my
0:09 youtube channel trouble free in this
0:11 video i'm going to explain you about the
0:13 data pre-processing
0:15 in the subject of data mining so basically
0:16 basically
0:18 what is data pre-processing what are the
0:20 steps involved in data pre-processing
0:23 and in those steps also we have some sub
0:24 steps everything i am going to explain
0:26 in this video so this video is gonna
0:28 become a bit longer i don't want to
0:31 divide it into sub parts just because of
0:32 the reason that i have created the
0:35 thumbnails already so i have to again uh
0:37 you know edit the numbers and the
0:39 thumbnails and all so i'm becoming a bit
0:42 lazy in that excuse me for that so let's
0:44 get into the video now
0:47 so data preprocessing is nothing but
0:50 the process of transforming or you can
0:53 say converting
0:56 raw data into an understandable format
1:00 okay suppose if you are having uh
1:00 uh
1:03 the marks of a students marks of 60
1:05 students in data mining subject
1:07 then all the names of the students will
1:11 be like a b c and so on up to z
1:13 and the marks will be in this way
1:16 1991 and so on up to 100 so the names of
1:18 the students are separate the marks of
1:20 the students are separate all of them
1:22 are in a some random format then can you
1:24 understand which person got how many
1:27 marks no right so that is what raw data
1:29 means an understandable format is
1:31 nothing but you divide you make it as a
1:33 table or you make it as a chart or you
1:35 make it as a graph whatever it is so
1:37 that the data can be understood okay
1:39 that is what understandable format so
1:41 the process of converting the raw data
1:43 into an understandable format is called
1:45 data preprocessing got it in data
1:46 pre-processing we actually have four
1:49 steps okay data cleaning data
1:52 integration data reduction and data
1:53 transformation these are the four steps
1:56 we have in data pre-processing
1:59 got it so let us in detail learn about
2:02 each and every step now okay first step
2:04 is the data cleaning so in data cleaning
2:07 what will happen is it is a process of
2:12 removal of incorrect incomplete
2:15 okay inaccurate data it also replaces
2:16 the missing data if there is any
2:18 incorrect data or if there is any
2:20 incomplete data or inaccurate data or
2:22 inconsistent data or any error in the
2:24 data whatever it is
2:28 so those data can be removed and also it
2:31 will replace the missing values that is
2:33 in case of missing values if there are
2:36 any empty spaces in those spaces you can
2:39 add the values got it that is about data cleaning
2:40 cleaning
2:42 okay so in data cleaning we have two
2:45 things actually okay as i already said
2:48 you will have handling missing
2:51 values and handling
2:52 handling
2:54 noisy data so missing values is nothing
2:56 but empty spaces noisy data is nothing
2:58 but this incorrect or incomplete or in
3:00 accurate or error data whatever it is
3:03 will come under noisy so how to handle
3:05 missing values how to handle noisy data
3:06 i'll tell you first
3:07 first
3:09 handling uh missing values right so in
3:12 case of handling missing values you can
3:14 do in many ways like sorry
3:14 you can
3:16 replace it with n a that is not
3:18 applicable or in a you can write in
3:20 place of missing value
3:22 or you can replace it with the mean
3:24 value okay in case of normal
3:26 distribution you can use this in case of
3:29 if if the data is normally distributed
3:30 in that case you can replace with the
3:32 mean value okay the mean value in the
3:35 sense whatever remaining data is there
3:37 apart from the missing data all the data
3:39 you have to calculate the mean and with
3:42 that mean you can replace got it next
3:44 median values you can replace with the
3:46 median values as well when you can
3:48 replace with median values in case of
3:51 non-normal distribution if the data is
3:53 normally distributed in that case you
3:54 can replace with median if it is normal
3:57 you can replace it with mean got it this
3:58 is about handling missing values and we
4:00 have some more don't worry sometimes you
4:01 can also replace them with the most
4:04 probable values that is the values uh
4:07 which can occur most probably got it
4:08 that is
4:10 there is a high chance for that value to
4:13 occur okay and missing values can
4:15 actually be filled in two ways actually
4:16 have to say it in the beginning but i
4:19 forgot manual automatic manual in the
4:21 sense you can use it only for small data
4:23 that is manually you will be filling you
4:25 will be identifying the empty spaces and
4:26 in that empty spaces you will be filling
4:28 the data but this will work fine only
4:31 for small data sets got it and next is
4:33 the automatic automatic is more
4:34 efficient when compared to manual
4:37 obviously and it suits for large data
4:40 sets got it so after this we are done
4:42 with handling missing values right now
4:44 we have handling noisy data so in
4:46 handling noisy data
4:48 noisy data is nothing but inconsistent
4:49 or error data we have several methods to
4:52 do it we actually have three methods
4:54 first is binding
4:56 billing is new one for you and next is
4:58 regression and clustering they are not
5:00 new for you okay so binding what you
5:03 will do is first you will sort the data
5:06 okay along with the error values only
5:08 you will be sorting the data
5:10 once the data is sorted you will be
5:12 storing those data into bins okay bins
5:14 you will be creating bins and you will
5:16 be creating the data into bins with the
5:17 stored data which is there you will be
5:19 storing them into the bins once you
5:21 store the data into bins what you will
5:23 do is
5:25 you will be doing the smoothing process
5:26 smoothing process is nothing but
5:29 removing the error values or replacing
5:31 the error values got it and this
5:33 smoothing process also can be done in
5:35 three ways mean median boundary okay
5:37 i'll tell you what is mean median
5:39 boundary now don't worry so first in
5:42 case of mean what you will do is
5:44 the values which are present in the bin
5:46 are replaced by the mean of mean value
5:49 of the bin suppose if 2 3
5:52 4 5 are there in the bin okay and now
5:54 four is the error value so what is the
5:57 average of this bin now two plus three
5:59 plus by four two plus three is five five
6:00 plus five
6:03 uh ten fourteen fourteen by four will
6:04 give you something around
6:06 three point something right so that
6:08 three point something will be replaced
6:11 in place of all these values okay
6:13 okay like that you will be replacing it
6:15 with mean in case of
6:17 smoothing by bin mean in case of
6:20 smoothing by bin median method you will
6:22 be replacing with the help of the median
6:24 you know what is median right when you
6:26 arrange the data from mean median mode
6:27 in statistics we know but still i'll
6:29 tell you when you arrange the data in
6:31 the ascending order ascending order yeah
6:33 ascending order that is small to be in
6:35 any order like i think only ascending
6:37 okay i'm sorry for that when you arrange
6:39 the data in a particular order in
6:41 ascending order or descending order then
6:43 whichever value is in the middle of the
6:46 uh order data set that is called as a
6:49 median like we have one two three four
6:51 five the data isn't sorted we write here
6:53 three is a median because three is in
6:56 the middle of the list one two
6:58 four five two places two places it is in
6:59 the middle right so that is why you
7:03 replace with three okay that is about
7:05 bin media next comes a bin boundary
7:07 boundary means what mean
7:09 and max values you will be replacing it
7:11 with the min and max values that's
7:13 simple okay this is about binning first
7:15 you will sort the data you will store
7:16 that sorted data into bins and then you
7:18 will be applying smoothing any of this
7:20 smoothing method you can apply got it next
7:21 next
7:22 regression regression is nothing but
7:24 numerical prediction of data so what is
7:26 regression about regression everything
7:27 in data you will be learning in the next
7:29 coming videos don't worry you can just
7:30 write numerical prediction of data and
7:32 leave it in case of exam in case of this
7:34 data preprocessing question next comes
7:36 the clustering clustering also i have
7:37 already explained
7:39 like similar data items similar things
7:42 are grouped into one cluster and
7:43 whatever dissimilar items are there they
7:45 are thrown out of the cluster so that
7:47 the dissimilar items are nothing but the
7:49 error items so that you can easily
7:51 remove the error items got it this is
7:53 about the clustering okay so this is
7:55 about data cleaning in data cleaning we
7:57 have two things handling missing data
7:59 and handling noisy data noisy data is
8:01 nothing but the error data okay in
8:02 handling missing data we don't have any
8:04 sub categories but in case of handling
8:06 noisy data again we have three things
8:09 binning regression clustering okay next
8:10 is data integration i said the video is
8:12 going to be very long so data cleaning
8:14 itself took six minutes or seven eight
8:16 minutes and so
8:18 data integration next next coming ones
8:20 will be taking
8:22 i don't think so they'll take more time
8:24 okay that's okay let's go with the flow
8:26 next data integration right so data
8:28 integration is nothing but
8:31 you will be integrating the data into a
8:34 single data set from the multiple
8:36 sources multiple heterogeneous sources
8:38 is nothing but mul different different
8:39 types of sources different different
8:41 types of data you'll take okay
8:43 homogeneous means everything uniform
8:45 same right heterogeneous means different
8:46 different types of data you can take
8:48 numbers you can combine numbers
8:52 words alphabets symbols words whatever
8:54 it is you want you can combine
8:56 heterogeneous sources of data are
8:58 combined into a single data set got it
9:01 this is data integration okay and in
9:03 data integration also it can be done in
9:04 two ways
9:06 okay that is tight coupling and lose
9:08 coupling so what do you mean by tight
9:10 coupling data is combined together into
9:12 a physical location that is suppose you
9:15 have data source a and data source b now
9:17 what happens is you will be combining
9:19 both a and b and you will be storing it
9:21 into a separate physical location called
9:24 as c that is when you come back to a or
9:25 when you come back to b separately if
9:28 you want to access to a or b you cannot
9:31 do that in case of tight coupling okay
9:32 once the data is integrated once you
9:34 have combined the data
9:36 you cannot again separately have access
9:39 to the data in case of tight coupling
9:41 whereas in case of loose coupling what
9:42 happens is the data is actually not
9:45 integrated okay only an interface will
9:48 be created and data is combined through
9:50 that interface and also access to that
9:52 interface that is
9:54 like a cloud kind of thing you can
9:57 imagine like the data is not actually
9:59 combined so here what you can do in case
10:01 of loose coupling is you can have access
10:03 to the combined data you can have access
10:05 to the individual data as well because
10:07 you are not physically combining the
10:09 data you are combining the data only
10:11 through an interface got it so
10:12 so
10:15 like based on it it happens dynamically
10:16 we can say like you know if you are
10:19 asking for some mining a query then
10:20 based on your query it will then and
10:22 there itself combine the data and give
10:24 you the result okay like that okay that
10:26 is about data integration the word
10:28 integration itself says you're combining
10:30 something done after data integration we
10:32 have data reduction so what do you mean
10:35 by data reduction see actually if you
10:36 are having large and large amount of data
10:37 data then
10:39 then the
10:40 the
10:42 analysis of the data will become hard
10:43 right so
10:46 searching from 10 members is easy or
10:47 searching from thousand members is easy
10:50 obviously 10 members right so if the
10:52 volume of data is very high then the
10:54 performance also will be low so for that reason
10:55 reason
10:57 in data reduction what you will be doing
11:01 is the volume of the data is reduced in
11:03 order to make the analysis easier okay
11:05 so data is reduced and you can do it in
11:08 two ways you will do the lossy and the
11:10 lossless lossy means some of the data
11:12 will be lost last less means
11:14 you will not nothing no data will be
11:16 lost everything will be as it is but the
11:18 data will be compressed like we use
11:20 online compressors right pdf compressor
11:23 we will be using if sometimes uh if if
11:25 it is not supporting if it has to be
11:27 maximum 2 mb or 1mb it will be like in
11:29 some of the websites where we are
11:30 uploading something we'll be using
11:32 compressors right so the same way
11:34 volume of the data will be reduced in
11:36 order to make the analysis easier okay
11:38 and here we have several methods and
11:40 data reduction i'll tell you what are
11:42 those first one is the dimensionality
11:44 reduction in dimensionality reduction
11:46 what happens is
11:48 it will reduce the number of input variables
11:50 variables
11:52 okay the number of input variables in
11:55 the data set is reduced so that
11:57 obviously automatically the data which
11:59 is associated with those um input
12:03 variables also is reduced and
12:05 performance will be increased if there
12:06 are large number of input variables
12:08 obviously dependencies also will be more
12:10 right dependencies in the sense one
12:12 variable depending on the other variable
12:14 dependencies will be more the data will
12:17 be more so once you remove the uh
12:18 input variables once you reduce the
12:20 input variables then the dependencies
12:22 will be reduced along with the data also
12:25 will be reduced so that is why
12:26 dimensionality that is what
12:28 dimensionality reduction is next
12:29 next
12:31 data cube aggregation in data cube
12:33 aggregation what you will do is you will
12:36 be combining the raw data that is
12:38 individual pieces of data
12:40 will be combined together to construct a
12:43 data cube okay i've already explained
12:44 about what data cube is the first or
12:46 second video i guess so you will be
12:49 creating a data cube then
12:51 whatever data is there with that data
12:52 only we are creating the data cube right
12:55 and then how data is reduced here
12:57 the redundant data that is the duplicate
12:59 data repeating data or the noising data
13:01 if it is present
13:03 that will be removed from the
13:06 data and a unique data cube will be
13:08 generated that is about data cube
13:10 aggregation next comes the attribute
13:13 subset selection so here what happens is
13:15 you will have so many attributes
13:16 attributes are nothing but the columns
13:18 okay you'll have so many columns in the
13:20 uh data
13:21 or in a table or in a data warehouse or
13:23 in a data mining system you will have so
13:25 many tables sorry so many columns
13:27 associated with a single table right
13:30 attributes are nothing but columns
13:32 okay so highly relevant attributes
13:34 should be used others should be
13:35 discarded that is others should be deleted
13:36 deleted removed
13:38 removed
13:40 so whatever are highly relevant in
13:42 essence related to the data or whichever
13:45 are highly important only that data
13:47 should be used other data has to be
13:50 removed from the database got it so in
13:52 this way also data can be reduced this
13:54 is what called as the attribute subset
13:56 selection next is numerosity reduction
13:59 in data reduction only the fourth method
14:02 here we store only model of data instead
14:03 of the entire data instead of storing
14:06 the entire data only model that is
14:08 sample of the data so that if we test on
14:11 this data or if we do any operations on
14:12 this data it will be you know for
14:14 example in our college only during lab
14:17 exams or during project submissions what
14:18 they will do they will get so many
14:20 records 60 to 65 records they will get
14:21 per class right
14:23 per section depending on the strength of
14:24 the section we'll restore all the
14:27 records no right they will store only
14:28 five to six records just for reference
14:30 for the next year or for inspection or
14:32 so they will not restore everything so
14:34 here also instead of storing the entire
14:36 data they will store only sample or the
14:38 model of the data
14:41 got it that is about numerous reduction
14:43 so for this we have completed data reduction
14:44 reduction
14:46 okay next is data transformation so in
14:48 data transformation you already know
14:50 what is data transformation you will be
14:52 transforming the data into appropriate
14:53 form which is suitable for the data
14:54 mining process like you cannot just
14:56 randomly go and do the data mining
14:58 process from a raw data like abc if the
15:01 data is arranged in form of
15:03 comma separated values or uh you know
15:06 you cannot just randomly go and do data
15:08 mining operations on with whatever data
15:10 you want right it has to be suitable
15:13 format so that will be done by the data
15:14 transformation right the data
15:16 transformation step will be transforming
15:18 the data into suitable format and that
15:20 also we have four methods again
15:22 normalization so normalization is done
15:24 in order to scale the data values in a
15:26 specified range so this is not
15:29 applicable for everything because they
15:30 you cannot
15:32 scale everything from a range of 0 to 1
15:35 or 1 to negative 1 to positive 1 right
15:37 sometimes you'll have name sometimes
15:39 you'll have sections sometimes will have
15:40 different different things it is not
15:42 possible to always scale the data so
15:44 whenever possible you can use this
15:46 normalization if you want to arrange the
15:48 data in a specified range you can go for
15:51 normalization got it next after
15:53 normalization we have attribute selection
15:54 selection
15:56 that is you will be creating new
16:00 attributes by using the older ones by
16:01 using the older attributes you will be
16:04 creating the new attributes got it that
16:06 is attribute selection simple next comes
16:09 the discretization so in discretization
16:11 raw values are replaced by interval
16:14 discretization the word itself says raw
16:15 values in the sense
16:17 uh suppose
16:20 you have values like 10
16:26 12 13 14 21 22 34 like 36 like this so
16:29 instead of 10 12 13 14 these will be
16:31 replaced like from 10 to 20
16:33 these will be replaced like from 20 to 30
16:35 30
16:37 30 to 40 like that instead of raw values
16:38 you'll be getting the you'll be
16:40 generating the intervals for that okay
16:42 next is concept hierarchy generation
16:44 that is you are converting the
16:46 attributes from low level to high level
16:48 that is the city is an attribute let us
16:51 take city is an attribute from city you
16:53 are generating country you are you you
16:54 know you are
16:56 converting city into countries city is
16:59 what actually an attribute country is
17:01 also an attribute but city is a low
17:03 level attribute whereas country is a
17:05 high level attribute got it that is the
17:07 difference between the city and country
17:10 okay so concept hierarchy means you will
17:12 be converting the low level attributes
17:14 into high level attributes that's all so
17:15 this is all about this data
17:19 pre-processing uh so i'm done with the
17:23 video i know the video is long you guys
17:25 and it is very hard for you people to
17:26 remember all these side headings as well
17:28 i understand but still i try to make it
17:31 as simple as i can so that's all that's
17:32 all for this video let's meet up in the
17:33 next coming video with another topic
17:35 till then if you're still having any
17:36 doubts just let me know in the comment
17:38 section i'll be very happy to clear your
17:40 doubts if i can and all the best for
17:42 your exams thanks for watching the video
17:44 till the end and i have started a new
17:45 channel about study abroad content if
17:47 you're interested have a look at the
17:48 channel i'll give the link of the
17:49 channel in the description let's meet
17:50 ups in the next coming video with