YouTube Transcript:
#8 Data Preprocessing In Data Mining - 4 Steps |DM|

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

hello everyone welcome back to my

youtube channel trouble free in this

video i'm going to explain you about the

data pre-processing

in the subject of data mining so basically

basically

what is data pre-processing what are the

steps involved in data pre-processing

and in those steps also we have some sub

steps everything i am going to explain

in this video so this video is gonna

become a bit longer i don't want to

divide it into sub parts just because of

the reason that i have created the

thumbnails already so i have to again uh

you know edit the numbers and the

thumbnails and all so i'm becoming a bit

lazy in that excuse me for that so let's

get into the video now

so data preprocessing is nothing but

the process of transforming or you can

say converting

raw data into an understandable format

okay suppose if you are having uh

the marks of a students marks of 60

students in data mining subject

then all the names of the students will

be like a b c and so on up to z

and the marks will be in this way

1991 and so on up to 100 so the names of

the students are separate the marks of

the students are separate all of them

are in a some random format then can you

understand which person got how many

marks no right so that is what raw data

means an understandable format is

nothing but you divide you make it as a

table or you make it as a chart or you

make it as a graph whatever it is so

that the data can be understood okay

that is what understandable format so

the process of converting the raw data

into an understandable format is called

data preprocessing got it in data

pre-processing we actually have four

steps okay data cleaning data

integration data reduction and data

transformation these are the four steps

we have in data pre-processing

got it so let us in detail learn about

each and every step now okay first step

is the data cleaning so in data cleaning

what will happen is it is a process of

removal of incorrect incomplete

okay inaccurate data it also replaces

the missing data if there is any

incorrect data or if there is any

incomplete data or inaccurate data or

inconsistent data or any error in the

data whatever it is

so those data can be removed and also it

will replace the missing values that is

in case of missing values if there are

any empty spaces in those spaces you can

add the values got it that is about data cleaning

cleaning

okay so in data cleaning we have two

things actually okay as i already said

you will have handling missing

values and handling

handling

noisy data so missing values is nothing

but empty spaces noisy data is nothing

but this incorrect or incomplete or in

accurate or error data whatever it is

will come under noisy so how to handle

missing values how to handle noisy data

i'll tell you first

first

handling uh missing values right so in

case of handling missing values you can

do in many ways like sorry

you can

replace it with n a that is not

applicable or in a you can write in

place of missing value

or you can replace it with the mean

value okay in case of normal

distribution you can use this in case of

if if the data is normally distributed

in that case you can replace with the

mean value okay the mean value in the

sense whatever remaining data is there

apart from the missing data all the data

you have to calculate the mean and with

that mean you can replace got it next

median values you can replace with the

median values as well when you can

replace with median values in case of

non-normal distribution if the data is

normally distributed in that case you

can replace with median if it is normal

you can replace it with mean got it this

is about handling missing values and we

have some more don't worry sometimes you

can also replace them with the most

probable values that is the values uh

which can occur most probably got it

that is

there is a high chance for that value to

occur okay and missing values can

actually be filled in two ways actually

have to say it in the beginning but i

forgot manual automatic manual in the

sense you can use it only for small data

that is manually you will be filling you

will be identifying the empty spaces and

in that empty spaces you will be filling

the data but this will work fine only

for small data sets got it and next is

the automatic automatic is more

efficient when compared to manual

obviously and it suits for large data

sets got it so after this we are done

with handling missing values right now

we have handling noisy data so in

handling noisy data

noisy data is nothing but inconsistent

or error data we have several methods to

do it we actually have three methods

first is binding

billing is new one for you and next is

regression and clustering they are not

new for you okay so binding what you

will do is first you will sort the data

okay along with the error values only

you will be sorting the data

once the data is sorted you will be

storing those data into bins okay bins

you will be creating bins and you will

be creating the data into bins with the

stored data which is there you will be

storing them into the bins once you

store the data into bins what you will

do is

you will be doing the smoothing process

smoothing process is nothing but

removing the error values or replacing

the error values got it and this

smoothing process also can be done in

three ways mean median boundary okay

i'll tell you what is mean median

boundary now don't worry so first in

case of mean what you will do is

the values which are present in the bin

are replaced by the mean of mean value

of the bin suppose if 2 3

4 5 are there in the bin okay and now

four is the error value so what is the

average of this bin now two plus three

plus by four two plus three is five five

plus five

uh ten fourteen fourteen by four will

give you something around

three point something right so that

three point something will be replaced

in place of all these values okay

okay like that you will be replacing it

with mean in case of

smoothing by bin mean in case of

smoothing by bin median method you will

be replacing with the help of the median

you know what is median right when you

arrange the data from mean median mode

in statistics we know but still i'll

tell you when you arrange the data in

the ascending order ascending order yeah

ascending order that is small to be in

any order like i think only ascending

okay i'm sorry for that when you arrange

the data in a particular order in

ascending order or descending order then

whichever value is in the middle of the

uh order data set that is called as a

median like we have one two three four

five the data isn't sorted we write here

three is a median because three is in

the middle of the list one two

four five two places two places it is in

the middle right so that is why you

replace with three okay that is about

bin media next comes a bin boundary

boundary means what mean

and max values you will be replacing it

with the min and max values that's

simple okay this is about binning first

you will sort the data you will store

that sorted data into bins and then you

will be applying smoothing any of this

smoothing method you can apply got it next

regression regression is nothing but

numerical prediction of data so what is

regression about regression everything

in data you will be learning in the next

coming videos don't worry you can just

write numerical prediction of data and

leave it in case of exam in case of this

data preprocessing question next comes

the clustering clustering also i have

already explained

like similar data items similar things

are grouped into one cluster and

whatever dissimilar items are there they

are thrown out of the cluster so that

the dissimilar items are nothing but the

error items so that you can easily

remove the error items got it this is

about the clustering okay so this is

about data cleaning in data cleaning we

have two things handling missing data

and handling noisy data noisy data is

nothing but the error data okay in

handling missing data we don't have any

sub categories but in case of handling

noisy data again we have three things

binning regression clustering okay next

is data integration i said the video is

going to be very long so data cleaning

itself took six minutes or seven eight

minutes and so

data integration next next coming ones

will be taking

i don't think so they'll take more time

okay that's okay let's go with the flow

next data integration right so data

integration is nothing but

you will be integrating the data into a

single data set from the multiple

sources multiple heterogeneous sources

is nothing but mul different different

types of sources different different

types of data you'll take okay

homogeneous means everything uniform

same right heterogeneous means different

different types of data you can take

numbers you can combine numbers

words alphabets symbols words whatever

it is you want you can combine

heterogeneous sources of data are

combined into a single data set got it

this is data integration okay and in

data integration also it can be done in

two ways

okay that is tight coupling and lose

coupling so what do you mean by tight

coupling data is combined together into

a physical location that is suppose you

have data source a and data source b now

what happens is you will be combining

both a and b and you will be storing it

into a separate physical location called

as c that is when you come back to a or

when you come back to b separately if

you want to access to a or b you cannot

do that in case of tight coupling okay

once the data is integrated once you

have combined the data

you cannot again separately have access

to the data in case of tight coupling

whereas in case of loose coupling what

happens is the data is actually not

integrated okay only an interface will

be created and data is combined through

that interface and also access to that

interface that is

like a cloud kind of thing you can

imagine like the data is not actually

combined so here what you can do in case

of loose coupling is you can have access

to the combined data you can have access

to the individual data as well because

you are not physically combining the

data you are combining the data only

through an interface got it so

like based on it it happens dynamically

we can say like you know if you are

asking for some mining a query then

based on your query it will then and

there itself combine the data and give

you the result okay like that okay that

is about data integration the word

integration itself says you're combining

something done after data integration we

have data reduction so what do you mean

by data reduction see actually if you

are having large and large amount of data

data then

then the

the

analysis of the data will become hard

right so

searching from 10 members is easy or

searching from thousand members is easy

obviously 10 members right so if the

volume of data is very high then the

performance also will be low so for that reason

reason

in data reduction what you will be doing

is the volume of the data is reduced in

order to make the analysis easier okay

so data is reduced and you can do it in

two ways you will do the lossy and the

lossless lossy means some of the data

will be lost last less means

you will not nothing no data will be

lost everything will be as it is but the

data will be compressed like we use

online compressors right pdf compressor

we will be using if sometimes uh if if

it is not supporting if it has to be

maximum 2 mb or 1mb it will be like in

some of the websites where we are

uploading something we'll be using

compressors right so the same way

volume of the data will be reduced in

order to make the analysis easier okay

and here we have several methods and

data reduction i'll tell you what are

those first one is the dimensionality

reduction in dimensionality reduction

what happens is

it will reduce the number of input variables

variables

okay the number of input variables in

the data set is reduced so that

obviously automatically the data which

is associated with those um input

variables also is reduced and

performance will be increased if there

are large number of input variables

obviously dependencies also will be more

right dependencies in the sense one

variable depending on the other variable

dependencies will be more the data will

be more so once you remove the uh

input variables once you reduce the

input variables then the dependencies

will be reduced along with the data also

will be reduced so that is why

dimensionality that is what

dimensionality reduction is next

data cube aggregation in data cube

aggregation what you will do is you will

be combining the raw data that is

individual pieces of data

will be combined together to construct a

data cube okay i've already explained

about what data cube is the first or

second video i guess so you will be

creating a data cube then

whatever data is there with that data

only we are creating the data cube right

and then how data is reduced here

the redundant data that is the duplicate

data repeating data or the noising data

if it is present

that will be removed from the

data and a unique data cube will be

generated that is about data cube

aggregation next comes the attribute

subset selection so here what happens is

you will have so many attributes

attributes are nothing but the columns

okay you'll have so many columns in the

uh data

or in a table or in a data warehouse or

in a data mining system you will have so

many tables sorry so many columns

associated with a single table right

attributes are nothing but columns

okay so highly relevant attributes

should be used others should be

discarded that is others should be deleted

deleted removed

removed

so whatever are highly relevant in

essence related to the data or whichever

are highly important only that data

should be used other data has to be

removed from the database got it so in

this way also data can be reduced this

is what called as the attribute subset

selection next is numerosity reduction

in data reduction only the fourth method

here we store only model of data instead

of the entire data instead of storing

the entire data only model that is

sample of the data so that if we test on

this data or if we do any operations on

this data it will be you know for

example in our college only during lab

exams or during project submissions what

they will do they will get so many

records 60 to 65 records they will get

per class right

per section depending on the strength of

the section we'll restore all the

records no right they will store only

five to six records just for reference

for the next year or for inspection or

so they will not restore everything so

here also instead of storing the entire

data they will store only sample or the

model of the data

got it that is about numerous reduction

so for this we have completed data reduction

reduction

okay next is data transformation so in

data transformation you already know

what is data transformation you will be

transforming the data into appropriate

form which is suitable for the data

mining process like you cannot just

randomly go and do the data mining

process from a raw data like abc if the

data is arranged in form of

comma separated values or uh you know

you cannot just randomly go and do data

mining operations on with whatever data

you want right it has to be suitable

format so that will be done by the data

transformation right the data

transformation step will be transforming

the data into suitable format and that

also we have four methods again

normalization so normalization is done

in order to scale the data values in a

specified range so this is not

applicable for everything because they

you cannot

scale everything from a range of 0 to 1

or 1 to negative 1 to positive 1 right

sometimes you'll have name sometimes

you'll have sections sometimes will have

different different things it is not

possible to always scale the data so

whenever possible you can use this

normalization if you want to arrange the

data in a specified range you can go for

normalization got it next after

normalization we have attribute selection

selection

that is you will be creating new

attributes by using the older ones by

using the older attributes you will be

creating the new attributes got it that

is attribute selection simple next comes

the discretization so in discretization

raw values are replaced by interval

discretization the word itself says raw

values in the sense

uh suppose

you have values like 10

12 13 14 21 22 34 like 36 like this so

instead of 10 12 13 14 these will be

replaced like from 10 to 20

these will be replaced like from 20 to 30

30 to 40 like that instead of raw values

you'll be getting the you'll be

generating the intervals for that okay

next is concept hierarchy generation

that is you are converting the

attributes from low level to high level

that is the city is an attribute let us

take city is an attribute from city you

are generating country you are you you

know you are

converting city into countries city is

what actually an attribute country is

also an attribute but city is a low

level attribute whereas country is a

high level attribute got it that is the

difference between the city and country

okay so concept hierarchy means you will

be converting the low level attributes

into high level attributes that's all so

this is all about this data

pre-processing uh so i'm done with the

video i know the video is long you guys

and it is very hard for you people to

remember all these side headings as well

i understand but still i try to make it

as simple as i can so that's all that's

all for this video let's meet up in the

next coming video with another topic

till then if you're still having any

doubts just let me know in the comment

section i'll be very happy to clear your

doubts if i can and all the best for

your exams thanks for watching the video

till the end and i have started a new

channel about study abroad content if

you're interested have a look at the

channel i'll give the link of the

channel in the description let's meet

ups in the next coming video with

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…