Complete Exploratory Data Analysis And Feature Engineering In 3 Hours| Krish Naik | Krish Naik

hello guys today we are going to do a

lot of amazing things with respect to

eda so

so zomato

data set

exploratory data analysis right

we are going to complete this today

so before we start please make sure that

you download the data set inside my data

set i have so many things i'll just show

you you'll be having files like country

code dot xlx zomato dot csv

file ball json file to json file three

json file position file file json

today the data set that we are going to

use is jomato dataset uh i have found

out this particular data set from kaggle

so i put this

entire link over here github link so you

can download the data set from here also

you can download it from the print comment

comment

so let's go first of all i'm just going

to import

some basic libraries

import partners spd

import numpy as

as

np import matplotlib

dot pi plot as plt

and one more library that i am going to

use is something called a c bond and

finally we will be using matlab inline

so that the images or any visualization

gets displayed here itself i'll keep it

restricted to all these things and

understand the main thing the main thing

is that whenever you are performing eda

that is exploratory data analysis

you really need to think about the data

what the data is basically seeing or

telling you right that is most important

so whenever you have a specific data set

even though if you don't have much

domain knowledge

some basic information definitely you'll

be able to capture it so what we are

going to do over here is that till now

we have actually

imported all the libraries now let's go

ahead and first of all let's download

the data set from here so here you can

see that there is a data set called as zomato.csv

zomato.csv

countrycode.xlx and there are multiple

json file

now why this json file was given guys

because this json file is in the form of

this json format okay

this format has been converted already

into zomato.csv now how it has been com

converted you really need to write a

python script to convert this we will

not see this right now but in the later

stages i will also show you how we can

convert this json file into

into

zomato.csv that part we will do it in

the later stages in the upcoming classes

but today we have one xlxx file one csv

file we'll try to combine this file also

we'll try to see what all information

you have in this specific file so let's start

start

so first of all as usual i'll just write

df my data set and i'll write

pd dot read underscore csv and i will

read the data set which is

zomato.csv now when you are actually

importing this zomato.csb the other

thing that you need to see is over here

is that if i just execute it like this

so here i will be getting some errors it

says that utf-8 codec can't decode byte

0 xed in position 7 0 7 4 7 0 4

whenever you get this kind of error

always remember that you have to use

some kind of encoding format now in this

case what encoding you will be using if

you probably go and see in read

underscore csv and press shift tab here

you will be able to see lot of options

so one options i'll see over here

encoding encoding somewhere it will be

or you can see the parameters you can

search the parameters over here with

respect to encoding

what you need to put over there you can

play with three to four different values

but understand you need to have utf and

eight encoding so for this what i'm

going to do i'm going to use an encoding

and remember this encoding i did not

understand i did not use directly in the

first instance but i used after

exploring some of the things over here

with respect to the kind of error so

encoding here i'm going to use latin

dash one you have different different

encodings again utf-8 encoding uh you

can just check out the documentation

over there so i'm just going to use

latin one and then i'm going to

basically say df.head now if you go and

see over here these are all my data sets

that are available over here

it's a huge data set with respect to the

number of columns

but understand this is how we read the

data set over here

and with respect to this you can check

it out all the features and all but now one

one

thing that we have done is that we have

imported all the data set over here in

inside my df let's go to the next step now

now

so this is my data set over here that is

present now the next step what i'm

actually going to do i'm just going to

see what all columns i have inside my

data set so now this is in the basic eda

part so over here you have restaurant id

restaurant name country code city

address locality

locality verbose longitude latitude

questions average cost for two currency

and many more features are actually

present over here just go and search for

pandas documentation

anytime you have any kind of queries

with respect to what encoding you have

to write and all you can just directly

search for it

you can search from here when you search

from here anyhow anywhere you will be

able to see it why encoding is used why

utf-8 is used from here you have to

explore it over here you can see

encoding as none right white is

basically used

just click on this try to understand

that specific keyword now the next thing

over here let's go ahead and let's see

one more way of understanding about the

data set is like df.info

if you write df.info here you will be

able to see

what all columns are there

whether this column is normal or null

whether it is what is the data type of

this here you can see in 64. in 64 is

specifically for integer variables

whenever you see objects in pandas in

data frame object basically means

strings it can it can also mean like it

is maybe a categorical variable it may

be a text variable it can be anything

over here so here you can basically see

all those things you also have float

uh you have objects objects wherever

objects is there just consider that it

may be a categorical variable it may be

an integer variable it may be a text

data initially always you do this you

try to find out what are the columns try

to find out what are the

important information about the columns

with respect to the data type now coming

to the next step now let's see what we

can further do what what

actual information we can actually come

up with this there is also an inbuilt

keyword which is called as describe

so this is a basic inbuilt function

which is called a describe which will

actually help you to find out all the

specific information now one key

important information from this is that

here you will be able to see that all

the features that are basically taken

inside this describe function right

these are only integer features you will

not be able to find out any categorical

features any text features any object

features over there so here definitely

see with respect to any feature that you

see restaurant id if i go and see

restaurant id it is in 64. if i go and

see country code it is basically uh int

64. if i go ahead with longitude it is

always float or in 64c over here

longitude latitude float64 so all these

values that you are actually able to see

over here this is completely based on your

your

uh integer variables because whatever

thing you are doing like count mean

standard deviation mean you have to

basically find it out in the integer and

numerical variable now i'll just give

you a basic information in data analysis

the first thing that i would like to

find out is that we'll try to find out

missing values

first of all always it is very much

important in our data set do we have

missing values the second thing that we

may probably do

explore

about the numerical variables

third i would like to definitely explore about

categorical variables these are some

basic things because i need to know that

how many categories are there how many

numerical variables are there the fourth

major thing that we probably do is that

finding relationship between features

let's go ahead and try to find out what

are the missing values in order to find

out the missing values you can basically

write df dot

dot sum

so if i go and search over here you will

be able to see that with respect to

every feature it is just saying that how

many features are basically having null

value here you can see that 0 0 is there

0 is there 0 is there

what about duplicates i'll talk about

duplicates also so here you can see in

city 0 is there address 0 is there

locality 0 is there locality verbose

zero is their longitude latitude zero is

there but in cosines you can see that

there are nine missing values remaining

all you have zero missing values so here

in cosines you can see that there are

nine missing values if you want to do

anything with respect to the missing

values you basically have to work on

this specific feature now can i find out

any relationship with respect to cosines

with any other

target variables or any other

independent features okay that we will

try to do but right now you have got

this specific information that that many

number of missing values are there now

this is one way in another way i will

just write a simple code which will

actually tell me all the informations

all the features that has missing values

over here so what i can basically do

i'll say that features

features for

for features

features

so i'll write df.columns

i want to check which all variables has

missing values so i'm saying that for

every features in df.column

go and check if df of columns

df of columns which is represented by feature

feature

dot is null

dot sum

is greater than one

so this is basically a list

comprehension so here what i'm saying is

that features for features

in df.columns that basically means we

are using this temporary variable called

as features which will iterate through

df.columns and then i will say that if

that specific feature is null

or dot sum is greater than 1

i should not write greater than 1 but

instead i can write greater than 0 also

so if i go and execute it here

you can see that i am having cosines so

definitely i am able to get what is the

specific thing with respect to this that

i am able to see the null value now

let's go to the next step with respect to

to

is uh with respect to heat map can we

plot something so for heat map i will

basically be using snh dot heat map and

here i'll basically put the condition

which says that df dot is null

and here i will say that in y

because my second parameter is white

tick labels if i go and press shift tab

over here always try to see this feature

and with respect to this particular

feature whatever i am actually using x

stick label is there why tick label is

there right now i don't want to show

much things in y so i will just keep it

as false

because i am focusing only on df with

respect to that then i can also use c

bar it is also another feature over

there you can understand by just seeing

the documentation what all things it can do

do

and then i will use a c map

and inside this cmap i can use any one

i'll basically search for it over here

you can see

here more options is not visible you can

go to the c bond documentation page and

basically take out that specific

information so here i'm going to use a

cmap which is called as varidis

so here obviously i'm not able to see

that nine records because it may be somewhere

somewhere

probably i won't be able to see that

probably in this specific thing i should

have that right

let's see

sum sum cosines has 9 okay the total

number of let's say total number of df

df dot

dot

they are around nine five five one rows

so because of that it is not getting

visible over here

very small number of nand values so that

is the reason we cannot see it but if

there are many many you can definitely

check it out so we have we have

understood about the missing values and

we have seen that now i have already

told you that there is another data set

which is called as country code now

let's try to see that what this data set

basically have so i'm going to write df

dot underscore country and i'm going to

and then i'm going to basically write

df.country dot head

it is giving me an error let's see what

is the error

okay here also some problems with

respect to invalid continuation byte i

cannot use read underscore csv i have to

use read underscore excel because it is

an excel file otherwise again you have

to use that same encoding things to make

it work how to deal with missing values

uh that i will try to show you in

feature engineering so here you have

this one country code country code xls

so what you have over here see

country code country two features if i

go and probably see my df dot columns do

we have country code over here here also

country code are there can we combine

these two data frames so what we will do

in order to combine we will be using pd.merge

pd.merge

so merge is a function which will

actually help us to combine

in the left i will give my another data

set in the right i can give another data set

set

so here i'll give df and here i'll give df

df

df underscore country but let's see

another feature

so there will be one feature which will

basically say on

this on basically says that on which

feature you are basically going to

combine that two tables so here i'm just

going to say on is equal to

i'm going to copy this country code so

here i've come copied this country code

and then it'll basically left as how

and here there is also one more keyword

which is called as how

this how will basically specify whether

you have to focus on

left table or right table so here

probably somewhere you will be able to

see this is how whether you want a left

join the right join or inner join but

right now i want to really focus on my

left hand side of table which is df

because this has the entire data set in

the right hand side i just have one

additional column that is country name

so in order to combine it what i'm

actually going to do i'm just going to

focus on left

so here is my left and once you see this

you will be able to see that i will be

able to get all the records

and somewhere you'll also be able to see

country see in the last thing country is

getting added i will just save this in

my final data frame which is called as

final underscore df so this is my final

underscore df and now if you go and

probably see final underscore df dot head

head

and if you check the first two records

you will be able to find out everything

so finally final underscore df is my

entire data set now let's go ahead

inside the data set and try to explore

what all things we have there is also

another way to check data types

if you want to check data types you just

have to write something like this

final underscore df dot d types so there

is also d types which will actually help

you to just get the data types information

information

so just use dot d types and there you

will be able to see the entire data type

this on is basically used to match on

which column you are basically going to

combine just like how you do left join

right join

on on a specific column if you if you

have seen my sequel of my videos i have

already uploaded let's go to the next

step now

let's try to do something amazing and

now let's try to explore something from

the data now understand one thing is that

that

if i go and see this data there are

features like

okay let's let's open this let's open

this final underscore

df dot columns

here you'll be able to see there are

features like country code city address

locality locality verbs longitude

latitude cuisines average cost for two

currency this this this are there let's

pick up something okay let's pick up

probably let's see that i i just want to

find out something okay and mainly

understand whatever things i will do

right now i will make sure that i'll

write observations for those so what i'm

actually going to do over here is that

let's say that i'm going to use

something like this final

underscore df dot

dot country

country dot

dot

value underscore count what i'm actually

doing over here i'm just trying to find

out how many different countries are

there and with respect to this

particular countries so in this records

right with respect to a specific

countries how many records are there so

in india you will be able to see 8652 records

records

in united states you'll be able to see

434 united kingdom 80 60 60 60 60. so

from this what kind of observation do

you feel that you can come up with

can you say that zomato is mostly

available in india itself obviously in

usa they just have a website

which they will recommend some kind of restaurants

restaurants but

but

just understand one thing over here is

that in india the main base of zomato is

there so maximum number of transactions

that may probably be happening is in

india right i hope everybody is able to

understand right so from this this

information you are able to get

now if i write dot index

now with respect to the dot index you'll

be able to see i'm able to get all the

countries name with respect to that

specific records okay

so let me just save this probably in a

variable which is like country names

i'll tell you why i'm doing it

everything will make understand

completely after this i'm going to plot

some pie chart i'm going to plot some

chart now similarly if i use the same thing

thing

and if i execute it

with dot index you will be able to see

that i'm getting this country names but

with dot value counts i will also be

able to get

dot value counts

i'll be able to get

sorry countries dot sorry value count

start valuable

dot values

let's see

dot v a l values okay so with respect to

dot values i'm actually getting all the

number of records for that particular

country name now this two i have the

reason why i'm doing this here is that

because i'm going to create some pie chart

chart

now how do we create a flight chart so

you use plot.pi and

and

with respect to this

you use plot.pi

and with respect to this you can

actually put out all your variables so

i'm going to press shift tab

if i am actually putting plot uh pi pie

chart over here i definitely have to use this

this

now over here in the x value i will try

to use my names or values whatever

things you want let's say that i want to

use my values so here i will store this

as my country

value so i'm going to put this entirely

over here in the x axis because i want

to see in the pie chart

which country has the maximum

transactions or maximum

online orders or maximum kind of orders

over here so i'm going to use this as my

x axis so this is my x axis in plot.pi

so here if you expand it here you will

be able to see it and then you have

labels this is important okay labels is

basically to give the labels on top of

it so i'm just going to use labels is

equal to i'm going to assign this value

to something like country name okay

country name so these two things are

there now if i execute it here you will

be able to see that i'm getting a plot

now this plot looks really bad because

obviously the percentage of the

information spread towards the different

different countries is very less so it

is like jumbled up complete so what i am

going to do is that i am just going to say

say

which are the top five countries

top five countries or top three

countries the top three countries that

uses zomato that is based on your

transaction right so what i'm going to

do here i'm just going to use colon 3

here also i'm going to use colon 3

colon 3.

so that basically says from entire all

the values over here i'm just going to

take the top three values at top three

countries and i'm going to just display

now it looks good now which is the top

three countries that is basically using

india united states and united kingdom

right so i hope you are able to

understand over here with respect to the

pie chart like how is my data

distributed and over here definitely

with respect to zomatos no matter the

base companies in india so obviously you

can come to a conclusion that maximum

number of transactions will happen in

india now one more thing that i probably

want to add is something called as

percentage because i need to see some

percentage also right that would be

pretty much amazing right

so what i'm actually going to do over

here there will be a parameter which is

called as percentage also and that

parameter is something called auto

percentage so i will use this auto percentage

percentage

and i'm going to use one property

property

if i want to see one property over here

what will i assign to this you can

assign one format and that format i can

basically write something like this this

basically says that after this after the

decimal two values will be mentioned

when it is getting converted into

percentage so i'm just going to remove

this double quotes

and this will definitely work then play

with it if i write if i remove this two

what will happen if i remove this

if i remove f what will happen just try

to play with it now if i execute it here

you can see now

94.39 percentage

is basically the orders are from india

4.73 transaction is from united states

0.87 is from

united kingdom so here you need to write your

your

observation now tell me suggest me what

observation should i write over here

from this diagram what kind of

observation that you can see you just

need to add this particular property

to get the percentage values

tell me what is the percent observation

zomato maximum

maximum records

records

are from india

india okay

okay

usa

you have to write your observation in

your own words here i have just written something

something

but just try to write so here obser

zomata maximum records the transaction

are from india after that usa and then

united states

united kingdom sorry so this is my first

observation that i have been able to

take from this pie chart

major business is happening in india you

can say and all a lot of things can come

okay everybody is clear with this i hope

it's very simple till here okay now

now

let's go with respect to the next one

how do we identify how many numerical

variables are there how many numerical

variables are forget about numerical

variable let's do some exact relationship

relationship

numerical variables we can check it

check it in later stages

but i want to really do more observation

things more relationships things so that

i will be able to see something now if i

go and write final underscore df dot columns

columns

if i execute this here you can see some

amazing features which is called as

aggregating rating because i want to

also see with respect to the rating from

which country more rating is actually

coming and i want to see this data which

is called as rating color rating text

and all okay so what i'm actually going

to do

i'll just write a small query

final underscore df dot

group by i am going to use a group by operation

operation

and with respect to a group by operation

here i am going to use features

which is called as aggregate rating

aggregate rating and then i will also

see this everybody rating color i'm

going very slowly guys very very slowly

i think you can write it down i am

writing each and every line of code

rating color

and then i'm also going to use rating text

text

rating test so i'm basically going to uh

group by this three main features

and after this i'm also going to do one

thing so if i group by this

and probably execute

i'll be getting an error let's see what

is the error rating text so it should be

rating small t

so if i execute here you can see that it

is now a data frame group by object

now if i write dot size

so if you if i execute this dot size

here you will be able to see all the

values like white

not rated this this this this are there

and similarly good good good very good

see over here one thing you can see that

when the rating color is white that

basically means your aggregate rating is 0.0

0.0

if your rating is red then it is

basically showing 1.8

1.9 is also red 2.0 is also red 2.1 is

also red like this 2.4 is also red so

all these are red red basically means it

is poor it is poor so this ratings are

poor with respect to this aggregate

rating you can see that it is poor if i

go with respect to the next one which is

in orange color here you can see that

these are my all average ratings from

2.5 to 3.4 then you have from 3.5 to 3.9

that is another rating over here here

you can also see that these are good

right so it is displayed in yellow color

or the text is written in yellow color

that like the rating colors are there

and then from 4.0 to 4.9 we have very

good and excellent so this information i

know i have actually able to find it out

so i'll also can write my uh

i'll try to write my own observation

over here now what i'm actually going to

do over here is that after i do this

i'll convert this into data frame now in

order to convert this into data frame

what i will do is that i will just write

reset underscore index

and this is an invalid error the reason

it is an invalid error because i have to

continue over here

reset underscore index and then i'm

basically going to just say that rename

or if i just execute this let's see what

i'll get so here you can get see that

i'm getting this particular things and

this is my zero value since i have done

group by

with respect to 0.0 ratings i have 2148

records then with respect to 1.8 i have

one record 1.92 records 2.07 records

over here 0 is coming so instead of this

0 i'll try to rename it with different

column so here i'm just going to use

after reset index dot rename

and here i'm going to basically use columns

columns

is equal to

and i'm going to name it to 0 colon

now let's do one thing

now see what i've done after doing reset

index i'm using rename function

and i'm saying wherever the columns is 0

change it to rating count

so once i execute this

you can see that i'm getting one error

because i have not closed it i will

close it now

so here i've closed it and here

and now here you can see that i'm

actually able to see this everybody you

just write down this code i know many

people will get stuck over here

now we we'll do multiple things with

respect to this so what are the

important information that i'm actually

able to get from here

are they correlated we'll try to find

out don't worry right now i've still not

gone into correlation those are some

inbuilt directly you're using inbuilt i

don't want to go into inbuilt right now

now over here main features everybody

has written this final underscore df dot

group by aggregate rating rating color

rating text dot size dot reset index dot rename

rename

columns you are renaming from 0 to

rating count

so here you can see that aggregate

rating is there rating color is there

rating text is there rating count is

there so all these informations you have

with yourself right all these amazing

information you have over here

now let's go to the next step

now what i'm actually going to do over

here is that i have my rating count information

information

reset index basically means it will just

reset this index

this index

by default whatever index is coming you

have to reset that

now i will just save this in a variable

this variable will play a very important

role guys now

so i'm giving you another one minute

please write it down so ratings is equal

to this one ef

ef

final underscore

df dot

guys please write it down

if anybody is not write it written down

then again i am going to share it to you here

here

please write them down this particular

code because it will be very much important

important

now i have all these things if you go

and see ratings

ratings

so here you have all the values average

rating rating color rating text and

writing now let's go ahead let's go

ahead and let's plot some amazing

beautiful diagrams now i want to really

find out

this all relationship with respect to

different different countries

with respect to different different

problem statements with respect to this

how see how as a data analyst data

scientist you have to think okay this is

my data set okay probably what what type

of visualization i can draw from this

because i want to do some kind of edn

okay what what kind of things i can do

about this just by seeing the data i can

definitely come up with one conclusion

is that

around 2148 ratings have zero rating

maximum number of people have actually

given zero ratings that basically means

they have not rated

the app or the entire zomato app itself

right so here what we are focusing on we

are trying to understand okay maximum

number of ratings zero basically means

person has not given any ratings right

so here you can see rating text is not

rated right people who are giving

ratings you can see poor average good

along with that colors are also given so

can we plot this in an amazing way so

that we can understand in a visualized

way also so let's go ahead

from this i can come up with conclusions

again i'll write conclusions

conclusions is very much important

observations i can also say observation

so this is my observation from this data set

set

the first observation is that

whenever the rating is from 4

4

to 4.9

or let's say from 4.9 to 4 sorry 4.5 to

4.9 so here i'm going to write the

observation when rating

is between

4.5 to 4.9

this indicates what does it indicate it

indicates that it is excellent

probably the foot that was delivered was

basically excellent second thing that we

can come up with this observation is that

that

here you can see that from from

from

3.5 to 3.9

when the rating when

when

when ratings

are between

3.5 to 3.9

here you can basically say that

i hope 3.9 only right

no 4.0

4.0 to 4.4

4.4

the ratings are very good the third

thing that i can come up with is that if

the rating is between 3.5

3.5 to

to 3.9

3.9

here the rating is good

good

so this is my observation because i can

definitely see from the data right and

remaining all please go ahead and write

it down okay

so another observation from 2.5 to 2.9

it is average

2.5 to 3.0 or 2.9

wait wait wait wait average 3.0 to 3.4

is average so 3.0 to

3.4 is average

so this is my next observation and fifth

i will go ahead and write

when the rating [Music]

[Music]

6 i'm going to write when the rating

when the rating is between 2.5

and it is 2.0

right so 2.9 how much it is average

again this is also average

uh 2.0 to 2.4 is poor right

right

so these are some of my observation just

complete down all the observations that

you can find out from this and one more

thing that you can see that zero rating

right so these are all my observations

with respect to this but if i am writing

observation it is better that we also

draw some kind of diagrams now here i'm

going to basically draw a diagram so

this is my ratings so here i'm going to

use aggregate rating let's say that this

is my writing dot head

so here i have aggregate hitting rating

color rating text rating count so i'm

going to use now c bond bar plot let's

see can we visualize with the help of

bar plot something in this so here i'm

basically going to use

uh in bar plot always understand what

all features you have so here you have x

y we data order we order everything is

there but what i am going to do i am

just going to do a simple bar plot

so here in the x axis i am going to

basically use

in the y i'm basically going to use

rating count

let's say that i'm going to see the

relationship between

aggregating rating aggregate aggregate

rating and rating count see this is my

aggregate rating and this is my rating

count i want to basically draw a bar

plot and basically check how the graphs

look like okay so the third parameter

here i am going to basically use data is

equal to ratings so once i write this

and execute it here you can basically

check out how beautiful it looks now the

diagram looks smaller so what i'm

actually going to do i'm just going to

put one

simple settings to increase the diagram

so that you'll be able to see it in a

better way okay and that settings is

basically there in the matplotlib so i'm

going to use something like this

and there is another setting which is

called as

matplotlib dot rc params figure dot

figure size here you can give with

respect to width and height i am now

giving 12 6.

so here

matplotlib okay import matplotlib i'm

just going to write it down

so now here you can see the diagram

looks quite bigger

now if i probably go and execute the

heat map over here again

let's see whether it will change or not

so now you can see this values right the

missing values once i made the diagram

little bit bigger

you can see this i've done it now what

is this missing code that we have missed

with respect to increasing the figure

size just write matplotlib dot rc parent

so with respect to any parameter that

you want to change you can basically use

this here i have set it to 12 comma 6

now once you see this diagram from this

diagram you can definitely find out a

lot of information this diagram looks

super cool

zero rating is more than 2000 over here

then you can see 2.2 2.3 2.9

complete it looks like a gaussian curve right

right

whenever you have a gaussian curve you

get a good sense of feeling yes

yes

now let's do one thing

over here you can see that rating color

is also there so it is always a good way

that we should also color this aggregate

rating with the help of colors that is

given over here

so this is the code everybody write it down

down

x aggregate ratings y rating count

now as i said okay i have this coloring

text rating color i have this white red

and all should we use this colors over

here also and probably try to get in the

form of colors and then try to see it so

that also will try to do it okay so to

get the colors uh what i'm actually

going to do i'm just going to copy the

same thing entirely there is one more

parameter which is called as hui

so if i write hui

is equal to

rating color

if i write this

and execute it

you will be able to see

c o h

orange color green color red color and

all but understand whatever color is

that this is not matching right

white looks like blue so this is

wherever you can see blue right it is

basically showing you zero rating but

according to this white red why this

zero should have white color right

so what i'm actually going to do over

here is that we have to map the colors

also now how to map the colors we will

try to see so mapping the colors let's

see over here so mapping i'm going to

basically use palette

and inside this palette i'm going to

basically use different different colors

so the first color that i want to show

over here is something called as white

the second color that i want to show is red

red

the third color that i want to show is orange

it should be in the list okay the fourth

color that i want to show in in yellow

the fifth color that i want to show is

in green

the sixth color also i want to show it

in green so here is what i have written

in palette this palette is a feature

that is present or

is an attribute that is present in bar

plot where you can give your own colors

as it is required based on your

requirement so once i execute this now

let's see some error is there has no

okay pellet spelling is wrong i guess it

should be tte

p a l e t t e palette

so once i execute this

and let's see now

so now you can see that i'm getting the

perfect color

right white is white then red

then orange then this then this then

this now from this also

what kind of observation you can

basically get

right what kind of observation

maximum number see i'll again write

observation first of all you write down

the code everybody

you'll be able to see that i'm getting

the colors but just go ahead and write

down the code and quickly see that what

type of graphs we are able to get over here

here

white is invisible don't worry it's fine

you want to make it in different color

then make it instead of white use it blue

blue

now from this what kind of observations

we can actually get

get

so observations i'll write it on over

here again observation

observation

first observation that i would like to make

make

not rated basically means this blue color

color

count is very high then

then

the second thing is that now

now

second observation that you can see that

maximum number of ratings

are between 2.5

2.5

to 3.4

maximum number of ratings are between this

this

so definitely these two observations you

can basically find it out

this two observation you can definitely

find it out clear everybody these two

observations we can basically fight it

off now just imagine that if you have

some ratings as missing then what do you do

do

suppose let's say that a person has

rated but you have some missing values now

now

can't you think that now probably you

can use the values between 2.5 to 3.4 as

an average

right so this is what

type of observations you can basically

have this is what because maximum number

of observations or ratings are between

2.5 to 3.5 so you will try to find out

the average between them and then try to

get it

so i hope you are having fun guys

now the next step

we will also see right now we have just

seen with respect to aggregate rating

and rating count i probably also want to

use with respect to just the coloring

part this rating color i want to plot

this as a count plot so count plot let's

plot it so i'm going to use snh dot

and here i'm basically going to use x is

equal to rating

rating

color okay

in count plot

we basically use this for plotting with

respect to categorical variables so here

also you basically give an x and y value

and we value so here i'm giving x value

and then i'm also going to give my data

which is my ratings

and then again i can give my palette

over here with the same list

that i have actually defined over here

palette the color should be same right

so that is the reason i'm just going to

copy this entirely

and paste it over here

so once i execute it here you will be

able to see

i'm getting

every time i write the wrong spelling so

here you can see white

white

red orange yellow green

dark green

this is with not respect to count guys

don't worry okay this y y axis here you

are able to see over here but understand

in rating what you have

what you have you basically have

something like this right so white is

only one record

red is so many records right this is my

red they are around five records then

orange they are around seven to eight records

records

right yellow there are on this many

records green they are this many records

don't consider that this count is

basically your rating count no this is

the frequency how frequently it is

now let's go ahead and do some more in-depth

in-depth

in-depth analysis in depth now you will

get more confused now i'll give you a

question please try it out from your

side okay find

find

the countries

find the countries or country name a

country's country countries name

that has given

zero rating

now this is my one of my query for you all

all

try to do it and i'll wait

let's let's try to do something guys you

should be getting some queries at least

very important interview question as a

data analyst find the country name

country's name that is that has given

zero ratings

please do it everybody

i'll be waiting for you

that has given zero rating how do you do it

it

you will definitely get more confused

find the country's names that has been given

given

that has given zero rating

i'll also try till then

final underscore df dot columns so

so

i need to basically get

all the country name so country name is

obviously there

okay and

those who are given zero ratings if zero

rating is there

probably i can identify with zero

ratings i can identify with

aggregate rating or

or

i can also identify with rating color

okay so two parameters i can definitely

find out with

so what i'm going to do over here is

that i'm just going to say rating color

let's use rating color

rating color

if i say if the rating color is equal to white

white

white is capital or small

so if i execute this here i'll be

getting like this false false true true so

so

i'll just write final underscore df

so here so many information i'm getting now

now

city city so many records are there

but i don't think so this is right

because here i may see different rating also

also

so here what i will do i'll do group by

and here i will specify my country

so if i execute this this is my data

frame so here again i'll be doing

dot size

dot reset index

if i execute this now i'm able to get it

brazil five different zero ratings is

given india two one three one three nine

zero ratings has been given

united kingdom one united states three

so again what is observations that you

can basically say

so here write down the observations again

right

just the say observation maximum

maximum

number of zero

zero

indian customers right

no no it's not about imbalanced data set

in this case

because if you see the data set right

over here two one three nine

zero ratings see out of the total

ratings how much is the total rating

that we saw

two one four eight

right and from them if you try to see

two one three nine

this is not getting used for models guys

because we don't know what we need to

predict right now we are just analyzing

the data taking out information from

that data

which currency

so this is my next question to you all

if you probably go and see final

underscore df.head

you will be able to see this specific thing

so sorry dot columns i will just write

it as dot columns

so here you have um

let's say where it is currency is there okay

okay

currency is there so

just try to do this

find out which currency is used by which

country if you want all the list of

records what you'll do so

what i'm actually going to do now

i'm going to use final underscore df

there are two

i i want basically country with respect

to currency so what i'm actually going

to write over here i'm going to

basically say country

comma currency

and then i'm going to basically use

group by again

and group by will again be based on

these two groups

that is country and currency dot

dot

size dot reset

reset index

reset index is used in many ways

so here you can see i'm actually getting australia

australia dollar

dollar

brazil brazilian rail canada dollars

indian indian rupees

um indonesia rupay new zealand and all

so two things one is group by dot size

dot reset index that's it

you don't have to do group by by

everything you have to just focus on two

records two features

now here one more feature is there see

has online delivery or not

so my next question to you all

for those people who have done this

the next question is that which

which countries

countries

do have

online deliveries option

so india has two four two three uae has

two eight amazing

that's nice

that basically means that the online

delivery is only available in india and

us but let's say that i want to find out

uh all the countries that has or has not

okay i will just use this code

so reset index that's it so what he has

done is that

he's basically used uh

two features has online delivery country

group by has online delivery country

and size dot reset index so here you can

basically see that australia it does not

have any brazil no online delivery

canada no online delivery

why india

why india is getting repeated again

okay in india also probably in some of

the reasons online delivery may not be

there perfect

in india in some of the regions you will

not be finding online zomato delivery

available okay so because of that some

records you will not be able to see so

but here you can see main two countries

that has online delivery is india and uae

uae

so obviously make some observation from

this and try to find out

so here i'm basically going to basically

say observations again

again

what is my observations over here

i will basically say

my first observation is that

are available in

india and

you a [Music]

[Music] done

done dhamal

next question

now the next type that i am actually

focusing on is that i'll give you one

question like how we did with respect to

the countries

how we did with respect to the country

similarly try to find out or create a

i hope everybody is understanding the question

question

so here if i write final underscore df.columns

df.columns

you will be able to see there is also a city

now i want to create a pie chart

again the same thing like how we did it

i'll go up

and i'll copy this two things let's see

so here is one

here instead of writing country i will

write it as city

then this is my values this is my index

so this is my countries cities that from

where the order has happened and i'll

try to draw a plot

pie plot okay so here i'll say plt dot pi

pi

and here i'm going to give two things

one is with respect to

values and then with respect to index

final underscore df country.values i

hope this works

fine x x and y

i have to basically given this as labels okay

okay labels

or let me make it little bit easy for you

let it make easy for you okay city values

values

i'm going to save it in this city labels

i'm going to save it in this

and this will basically be using index

so i've executed this so this will go

with city values

and this entirely will go with respect to

to

city labels so i let's say that i want to

to

get the top five cities

for cities which issue for top five city distribution

distribution

top five city distribution so here i

will just use

so once i execute this here you will be

able to see this

the first

oh it's coming as india why

dot dot by city value city labels why

[Music]

i think there is some mistake

final underscore

oh i have to use city

i had copied right so you should not do

don't do copy paste

so new delhi has the maximum number of transaction

transaction

gurgaon noida gaziabad and faridabar why

not bangalore i think in the data set

bangalore is not given

after this i'll also add one auto percentage

percentage

f

so if i go and see this here you'll be

able to see percentage

so maximum number of transaction is

so guys overall how was the session everybody

one assignment for you so

so assignment

assignment

find the

top 10 questions

questions

questions basically means food okay

put item so this will be for you

just do it one assignment and remaining

all i think i have done it

now in this data set i had never used

this data set for doing machine learning

modeling i needed this data set to find

out what all information i can capture

from it and finally i was able to do so

many things right i i did not worry

about distribution and all that is the

part when we

actually create a model with respect to

the data set at that point of time we do

so i hope you liked this particular

session it was fun it was comedy it was

can we group the other cities under rest

yes obviously you can do it right

right tomorrow

tomorrow

another amazing day another amazing data set

set

so that we will be working on it and

definitely you'll be able to learn a lot

as i said right

right

visit the website guys because here i'm

going to give the entire materials

materials

have you seen my website how do you like

to rate my website guys

so this i created in three to four hours

probably i'll also start showing you how

to create websites

so this entirely i created three to four

hours so everything will get updated in

this article also

see this

this live session is going on right now

all the materials will get uploaded over

here data set materials

so please make sure you do this

and yeah

start exploring it

okay guys so thank you keep on rocking

i'll see you all in tomorrow's video and

yes i will see you in tomorrow's session

tomorrow we'll have more in-depth

session thank you everybody bye bye take

care thank you guys i hope everybody has

downloaded the data set you'll have this

two data set one is test and train right

i'll talk about the problem statement

and today we are also going to do

feature engineering

and both these things right as usual

today we are going to do

black friday

data set i'll talk about the agenda everything

everything

eda and feature engineering we are going

to do both of this and we will keep our

model ready for model training ready

ready

means cleaning doing everything cleaning

cleaning [Music]

[Music] and preparing the data

and preparing the data for

for model training we are going to do this

model training we are going to do this today so this is the two things that we

today so this is the two things that we are going to do this is the agenda

are going to do this is the agenda so after doing this

so after doing this you can basically use any kind of model

you can basically use any kind of model and start working on it

and start working on it so quickly what are the basic library

so quickly what are the basic library that is required start

that is required start uploading it

uploading it write import pandas as pd i'll talk

write import pandas as pd i'll talk about the problem statement what exactly

about the problem statement what exactly is this

is this import numpy

import numpy as np

as np import

import matplot

matplot lib dot pi plot as plt

lib dot pi plot as plt import

import c bond as

c bond as sns and then

sns and then matplotlib dot pi plot as plt yeah sorry

matplotlib dot pi plot as plt yeah sorry in line so this is basically given in

in line so this is basically given in kaggle okay

kaggle okay so in kaggle whenever you get a specific

so in kaggle whenever you get a specific data set what do you have to do train

data set what do you have to do train and test

and test that all steps i'll show you so that you

that all steps i'll show you so that you can also participate in kaggle so let's

can also participate in kaggle so let's go ahead and let's go ahead with first

go ahead and let's go ahead with first of all importing the data site always

of all importing the data site always make sure that you write the comment

make sure that you write the comment so importing the data set the data set

so importing the data set the data set is already given to you so let's say i'm

is already given to you so let's say i'm going to name it as df train because i

going to name it as df train because i have two data set one is train and one

have two data set one is train and one is test data so this df train i'm just

is test data so this df train i'm just going to write pd dot read csv

going to write pd dot read csv and i'm just going to give my data set

and i'm just going to give my data set name

name black friday train dot underscore train

black friday train dot underscore train dot csv i have renamed the name guys for

dot csv i have renamed the name guys for you it will be train dot csv okay

you it will be train dot csv okay and then if i probably write df

and then if i probably write df underscore train dot shape

underscore train dot shape i will be able to

i will be able to see it

see it or if i write df.head i'll be able to

or if i write df.head i'll be able to see it

see it so i'll talk about the data what this

so i'll talk about the data what this data is basically about uh so this data

data is basically about uh so this data is an e-commerce data

is an e-commerce data so

so people who have bought some kind of

people who have bought some kind of products

products and based on that we need to predict

and based on that we need to predict what is the purchase capacity again

what is the purchase capacity again understand

understand i'm just going to basically talk about

i'm just going to basically talk about the problem statement

the problem statement here we want to build a model i'm just

here we want to build a model i'm just going to

going to put the problem statement over here

put the problem statement over here let's say i'm going to put a problem

let's say i'm going to put a problem statement over here

everybody read the problem statement anyhow i will be giving you all these

anyhow i will be giving you all these things materials everything

things materials everything in the github don't worry

in the github don't worry so

so i'll also put the data set link over

i'll also put the data set link over here

data set link so data set link is this

data set link and this will get saved over here so what is the problem

over here so what is the problem statement so this is the problem

statement so this is the problem statement that we are going to focus on

statement that we are going to focus on so the problem statement is that a

so the problem statement is that a retail company abc private limited wants

retail company abc private limited wants to understand the customer purchase

to understand the customer purchase behavior is an e-commerce data set data

behavior is an e-commerce data set data set is also very huge so it will be very

set is also very huge so it will be very good to work on it against various

good to work on it against various products of different categories they

products of different categories they have shared purchase summary of various

have shared purchase summary of various customers for selected high volume

customers for selected high volume products from last month

products from last month the data set also contains customer

the data set also contains customer demographics like age gender marital

demographics like age gender marital status city type stay in the current

status city type stay in the current city product details product id and

city product details product id and product category and total purchase

product category and total purchase amount from last month so

amount from last month so over here now they want to build a now

over here now they want to build a now they want to build a model to predict

they want to build a model to predict the purchase amount of customer against

the purchase amount of customer against various product that will help them to

various product that will help them to create a personalized offer for customer

create a personalized offer for customer against different products

against different products so this is the problem statement over

so this is the problem statement over here the problem statement is very

here the problem statement is very simple you need to create a model to

simple you need to create a model to predict the purchase amount of a

predict the purchase amount of a customer against various products right

customer against various products right so suppose if i have if i give this

so suppose if i have if i give this information like this product with this

information like this product with this product information these all things i

product information these all things i give then we should create a model that

give then we should create a model that will be able to

will be able to predict this purchasing capacity

predict this purchasing capacity right so this is the entire information

right so this is the entire information regarding the problem statement okay

regarding the problem statement okay so this is what we are going to do

so this is what we are going to do interesting we'll solve the problem here

interesting we'll solve the problem here only in front of me so i have

only in front of me so i have read

read the training data set the next step that

the training data set the next step that you have to do is basically start

you have to do is basically start reading the

reading the test data set now train data set test

test data set now train data set test data set see whenever you are given

data set see whenever you are given train and test obviously what initially

train and test obviously what initially you have to do you have to combine them

you have to do you have to combine them in a kaggle computation always remember

in a kaggle computation always remember to combine them so that all the data

to combine them so that all the data pre-processing that we can do we can

pre-processing that we can do we can perform on both the data set so here i

perform on both the data set so here i am going to now import

am going to now import the test data

the test data right so here i am going to say df

right so here i am going to say df underscore test

underscore test is equal to pd dot read underscore csv

and here i'm going to basically write black friday dot csv

black friday dot csv df underscore test dot head

df underscore test dot head in the test data you will not be able to

in the test data you will not be able to find the output variable variable so

find the output variable variable so here you can see

here you can see only take product category 3 is there

only take product category 3 is there here additional purchase column is there

here additional purchase column is there right

right so now if you want to combine the train

so now if you want to combine the train and test data how do you do it the next

and test data how do you do it the next statement is merge

statement is merge both

both train and test data so how do you merge

train and test data so how do you merge both train and test data

both train and test data we can use pandas dot merge

we can use pandas dot merge can we use pandas.merge or pandas.concat

can we use pandas.merge or pandas.concat or panda does append what what you want

or panda does append what what you want to use

to use let me try it some different way now

let me try it some different way now here i'm basically going to say df1 dot

here i'm basically going to say df1 dot append there is an append function

append there is an append function sorry df underscore train

sorry df underscore train dot append

dot append and df underscore test

and df underscore test so what will append basically do

so what will append basically do you can see the definition over here

you can see the definition over here append rows of other to the end of the

append rows of other to the end of the caller returning a new object right so

caller returning a new object right so i'm just going to do this there is also

i'm just going to do this there is also one more parameter that i see with

one more parameter that i see with respect to sort so sort by default is

respect to sort so sort by default is false right so i'm just going to execute

false right so i'm just going to execute this

this and then i'm basically going to store

and then i'm basically going to store this inside my df

this inside my df so this is my df dot head now

so this is my df dot head now you can also append it in different

you can also append it in different different ways i have no problem

different ways i have no problem okay it is up to you

okay it is up to you so this is the first step that we have

so this is the first step that we have actually merge also you can do

actually merge also you can do but again understand we have to append

but again understand we have to append it at the bottom right we are not

it at the bottom right we are not merging it like this

so merge if you want to do with words if it is

if you want to do with words if it is possible with merge try to do it

possible with merge try to do it instead of writing merge here i could

instead of writing merge here i could also add written append

also add written append merge also you can do it okay

merge also you can do it okay so this was the next step now let's go

so this was the next step now let's go to the next step everybody

to the next step everybody so basic basic

so basic basic code that we have seen already right one

code that we have seen already right one is df.info

is df.info we can check out this one here we can

we can check out this one here we can understand that how many different types

understand that how many different types of features are here

of features are here so obviously int is there object is

so obviously int is there object is there object is there object is the int

there object is there object is the int is the object intent float float float

is the object intent float float float is there so definitely when you see

is there so definitely when you see product id it will be a combination of

product id it will be a combination of both integer and

both integer and different values so it is basically an

different values so it is basically an object then you have gender obviously it

object then you have gender obviously it has male and females so categories that

has male and females so categories that is an object age is basically an object

is an object age is basically an object why age is an object because here you

why age is an object because here you will be able to see age is given in some

will be able to see age is given in some range 0 to 17 0 to 17 55 plus so this i

range 0 to 17 0 to 17 55 plus so this i can consider it as categorical variables

can consider it as categorical variables i'll also show you how to solve that

i'll also show you how to solve that particular problem also but i hope

particular problem also but i hope everybody has got our understanding till

everybody has got our understanding till here the next statement that we are

here the next statement that we are going to basically find out is something

going to basically find out is something called

called df.describe just to find out like what

df.describe just to find out like what is the percentile values and all so here

is the percentile values and all so here is just a basic information

is just a basic information that we are going to differ now tell me

that we are going to differ now tell me um

um which which column do you think out of

which which column do you think out of this is just waste you can directly

this is just waste you can directly blindly you can delete it

blindly you can delete it see over here there is a column which is

see over here there is a column which is called as user id user id

called as user id user id is just a unique id over here

is just a unique id over here so you can definitely go ahead and

so you can definitely go ahead and delete it okay user id will be of no use

delete it okay user id will be of no use product category everything other will

product category everything other will be getting used don't worry about that

be getting used don't worry about that but user id is definitely not useful so

but user id is definitely not useful so i am going to delete it so what i am

i am going to delete it so what i am actually going to do i am going to

actually going to do i am going to basically write df.drop

df.drop this is a statement which will basically use to drop the feature and

basically use to drop the feature and here i can give any number of features

here i can give any number of features any number of features

any number of features with respect to my feature name so

with respect to my feature name so feature name is nothing but user

feature name is nothing but user underscore id

underscore id so i'm just going to copy this paste it

so i'm just going to copy this paste it over here user underscore id and here

over here user underscore id and here one very much important parameter if i

one very much important parameter if i see in df.drop is access

see in df.drop is access access is equal to 0 basically means

access is equal to 0 basically means horizontally right row wise access is

horizontally right row wise access is equal to 1 basically means vertically

equal to 1 basically means vertically right column wise so we really need to

right column wise so we really need to drop it column wise so here i'm going to

drop it column wise so here i'm going to basically say it has access is equal to

basically say it has access is equal to 1 and here i'm going to specify in place

1 and here i'm going to specify in place is equal to true

is equal to true the in place is equal to true what it

the in place is equal to true what it will do is that it will remove that user

will do is that it will remove that user id and it will update automatically into

id and it will update automatically into the df value so if i go and probably

the df value so if i go and probably execute it and now if i go ahead and see

execute it and now if i go ahead and see df.head you will be able to see that

df.head you will be able to see that i'm actually able to see my product id

i'm actually able to see my product id gender

gender all the other information perfect

all the other information perfect so here we have basically done this we

so here we have basically done this we have dropped the user id we have df.head

have dropped the user id we have df.head we have everything ready now let's go

we have everything ready now let's go ahead towards the data preprocessing

ahead towards the data preprocessing side now tell me how many categorical

side now tell me how many categorical variables are there

variables are there how many categorical variables are there

how many categorical variables are there just by seeing this one you have gender

just by seeing this one you have gender one you have age

one you have age one you have occupation city stay in

one you have occupation city stay in current city this this but before that

current city this this but before that i also need to make sure that how many

i also need to make sure that how many number of missing values are there for

number of missing values are there for the missing values i may do something

the missing values i may do something which i will show you in the later

which i will show you in the later stages but let's focus on fixing the

stages but let's focus on fixing the categorical features right now so how

categorical features right now so how many category features are there you see

many category features are there you see over here gender is there age is there

over here gender is there age is there city category is also there so we will

city category is also there so we will try to fix this category features

try to fix this category features because our model will definitely not be

because our model will definitely not be able to understand

able to understand uh how my categorical features will be

uh how my categorical features will be there or not

there or not marital status is already numbers

marital status is already numbers but let's see what all things will

but let's see what all things will basically be there

basically be there so

so let us go ahead and take up age and try

let us go ahead and take up age and try to solve this convert this categorical

to solve this convert this categorical into

into a

category into a numerical will try to do that okay

that okay so

so first of all let's focus on this

first of all let's focus on this and let's go ahead

and let's go ahead now tell me with respect to gender i

now tell me with respect to gender i have male and female right with respect

have male and female right with respect to gender i have male and female now

to gender i have male and female now what should i do in order to probably in

what should i do in order to probably in male and female what kind of encoding i

male and female what kind of encoding i can definitely use so if i write pd dot

can definitely use so if i write pd dot get underscore dummies

get underscore dummies and if i give my df of

and if i give my df of gender

gender if i execute it here i will be able to

if i execute it here i will be able to get either

get either male or female so here am i actually

male or female so here am i actually getting ones or zeros right one is

getting ones or zeros right one is basically given to f

basically given to f zero is basically given to male okay so

zero is basically given to male okay so either in this way you can do it but

either in this way you can do it but again see what is the problem here if i

again see what is the problem here if i convert in this way then i have to

convert in this way then i have to create another data frame then i have to

create another data frame then i have to add this data frame over here then

add this data frame over here then delete this gender column can i do

delete this gender column can i do something within the data set itself

something within the data set itself where probably i can directly convert

where probably i can directly convert this wherever the f is zero sorry

this wherever the f is zero sorry wherever the gender is f i am going to

wherever the gender is f i am going to convert this into 0 or 1 whether

convert this into 0 or 1 whether m whether the gender is male i am going

m whether the gender is male i am going to convert it to 0 to 1. so how we are

to convert it to 0 to 1. so how we are going to do that guys

going to do that guys how we are going to do that

how we are going to do that yes i can definitely use drop drop

yes i can definitely use drop drop underscore first is equal to 1 i can

underscore first is equal to 1 i can definitely use over here

definitely use over here but understand i don't want to do in

but understand i don't want to do in this way because i have to save this

this way because i have to save this somewhere then i have to add a column

somewhere then i have to add a column over here i don't want to do in that way

over here i don't want to do in that way i want to find i want to find out a way

i want to find i want to find out a way where directly i have to do it over here

where directly i have to do it over here itself in this particular data frame

itself in this particular data frame itself so how do i do it so for this i

itself so how do i do it so for this i will be using a code simple code so i'll

will be using a code simple code so i'll write df of

write df of gender

gender and here i will say df of gender

dot map map method what it does is that see what

map method what it does is that see what does map method do

does map method do map method will basically map with

map method will basically map with respect to the conditions that i am

respect to the conditions that i am giving over here so here if i say my

giving over here so here if i say my first condition is that wherever i get

first condition is that wherever i get female i'm going to convert it into 0

female i'm going to convert it into 0 and wherever i get male i'm just going

and wherever i get male i'm just going to convert it into one

to convert it into one many people ask me when i'm

many people ask me when i'm teaching what is the map functionality

teaching what is the map functionality in python so here you can see easily

in python so here you can see easily within this particular data set you will

within this particular data set you will be able to see it over here now if i

be able to see it over here now if i write df dot head

write df dot head and if i probably see this

and if i probably see this you will be able to see now gender will

you will be able to see now gender will be zeros and ones so everybody write

be zeros and ones so everybody write down this code okay one more way is that

down this code okay one more way is that directly i assign this to

directly i assign this to df of

df of gender right so this way also you can do it

can do it so both the ways whichever way you feel

so both the ways whichever way you feel you want to do it just do it both the

you want to do it just do it both the ways it will work

is not ranking guys zeros and ones are not ranking one two three four five six

not ranking one two three four five six is basically ranking

uh zahida sen says do we have to apply feature engineering on training set only

feature engineering on training set only on touch data no on both on both you

on touch data no on both on both you have to apply i'll show you how you have

have to apply i'll show you how you have to apply both

to apply both okay perfect so everybody has done this

okay perfect so everybody has done this right

right so this is with respect to handling the

so this is with respect to handling the categorical feature

categorical feature handling

handling categorical feature

categorical feature age

age sorry gender

sorry gender so this is done

so this is done now let's go to the next step now the

now let's go to the next step now the next step what i'm actually going to do

next step what i'm actually going to do gender is done now we also need to

gender is done now we also need to handle age

handle age handle

handle categorical feature

age now why i am specifically saying age because here you go and see

because here you go and see age is what age is also a categorical

age is what age is also a categorical feature see 0 to 17 0 to 17 55 plus so

feature see 0 to 17 0 to 17 55 plus so first thing i will try to execute

first thing i will try to execute something like this

something like this i will write

i will write df of h

so this will basically give me how many unique values are there in age like 0 to

unique values are there in age like 0 to 17 55 plus 26 35 46 50 51 55 36 45 18 to

17 55 plus 26 35 46 50 51 55 36 45 18 to 25

25 now if i have in this particular unique

now if i have in this particular unique way now tell me how should i convert

way now tell me how should i convert this categorical feature into some

this categorical feature into some numerical features so here also i can

numerical features so here also i can actually do encoding so the type of

actually do encoding so the type of encoding what i will probably be doing

encoding what i will probably be doing many people will again get confused over

many people will again get confused over here so why why you are doing like this

here so why why you are doing like this so i'll just tell you so here also i'm

so i'll just tell you so here also i'm going to use df.h

right two things i can definitely do one is

two things i can definitely do one is directly get

directly get dummies

dummies you can directly do

you can directly do pd.getgrammys see this if i write pd dot

pd.getgrammys see this if i write pd dot get underscore dummies

get underscore dummies right and if i give it for df of age

i'll be able to get like this right and if i drop

drop first is equal to true then i will be

first is equal to true then i will be able to get like this then what i can do

able to get like this then what i can do i can save with this column name

i can save with this column name and i can put it inside my data frame

and i can put it inside my data frame i can do this okay

i can do this okay but just imagine something guys here a

but just imagine something guys here a domain knowledge will definitely come

domain knowledge will definitely come one very important thing

one very important thing do you think like shopping 0 to 17 years

do you think like shopping 0 to 17 years it will be very less right in an

it will be very less right in an e-commerce website it will be very very

e-commerce website it will be very very less right whereas if i say 26 to 35 it

less right whereas if i say 26 to 35 it may be more

may be more and where i say 18 to 25 it may be more

and where i say 18 to 25 it may be more 15 to 55 it may be more

15 to 55 it may be more 55 plus it may be very less right or 46

55 plus it may be very less right or 46 to 50 it may be also very very less so

to 50 it may be also very very less so here what we will do is that we'll just

here what we will do is that we'll just not try to convert this into dummies

not try to convert this into dummies let's do some ordinal encoding only

let's do some ordinal encoding only let's let's give some rank to it okay

let's let's give some rank to it okay let's let's give some directly some

let's let's give some directly some values like 0 1 2 3 4 5 why i'm saying

values like 0 1 2 3 4 5 why i'm saying to give 0 1 2 3 4 5 because

to give 0 1 2 3 4 5 because if i'm training the model

if i'm training the model my model maths will definitely be able

my model maths will definitely be able to understand right my model maths will

to understand right my model maths will definitely be able to understand with

definitely be able to understand with respect to the values that we are given

respect to the values that we are given like zero one two three four five

like zero one two three four five whatever values i am actually given with

whatever values i am actually given with respect to the other features my model

respect to the other features my model will definitely be able to understand

will definitely be able to understand this is also called as target guiding so

this is also called as target guiding so we will do something like this okay but

we will do something like this okay but this this will definitely not work this

this this will definitely not work this is not a very good practice also so here

is not a very good practice also so here i'm just going to comment it out and

i'm just going to comment it out and this will definitely not work

this will definitely not work instead

instead what i will actually give is that

what i will actually give is that i will say

i will say uh let's apply the same map function

uh let's apply the same map function which i had applied over here so here

which i had applied over here so here i'm going to basically give it this way

i'm going to basically give it this way map function

map function and i'm just going to basically put it

and i'm just going to basically put it inside this

inside this here definitely i'll say age

here definitely i'll say age this h

this h and

and mapping i will do for 0 to 17 first

mapping i will do for 0 to 17 first let's say for 0 to 17 i am actually

let's say for 0 to 17 i am actually giving some numbers let's say i'm giving

giving some numbers let's say i'm giving it over here as 1

because at least some value should be there then 18 to 25 in the sorted order

there then 18 to 25 in the sorted order i'll try to give 18 to 25 my second one

i'll try to give 18 to 25 my second one and here i will give it h2

and here i will give it h2 then third one again in the sorted order

then third one again in the sorted order 26 to 35

26 to 35 i will give it over here

i will give it over here because see my model when i'm training

because see my model when i'm training my model it will be able to understand

my model it will be able to understand this is called as target guiding target

this is called as target guiding target ordinal encoding then

ordinal encoding then what we have 36 to 45

what we have 36 to 45 i hope i'm doing it right

colon here i'm actually going to give it as 4.

46 to 50 i have 5

5 and then i will be writing 51 to 55

and i will say it as 6

and then 55

55 i'll say it as seven

i'll say it as seven label encoding can also be done

label encoding can also be done label encoding will also work

label encoding will also work perfect label encoding will also work

perfect label encoding will also work but again

but again understand

understand for this again you have to for label

for this again you have to for label encoding you have to import a library

encoding you have to import a library and then perform it here also you can do

and then perform it here also you can do this way you will become

this way you will become good at maths

good at maths don't put zero guys see as i said i as

don't put zero guys see as i said i as i'm saying right there will be some

i'm saying right there will be some mathematical equations that will be

mathematical equations that will be happening so if you want to do label

happening so if you want to do label encoding how you'll do

encoding how you'll do label encoding

label encoding in python let's see some articles uh i

in python let's see some articles uh i have some article from geeksforgreek

have some article from geeksforgreek so

so [Music]

[Music] let's see so i have to basically

let's see so i have to basically upload this entire thing right this

upload this entire thing right this entire code

entire code see entire code by using pre-processing

see entire code by using pre-processing label encoder and all but i don't want

label encoder and all but i don't want to do it because as i get a new data set

to do it because as i get a new data set over there also i should be able to

over there also i should be able to apply all these things right so here

apply all these things right so here i'll just copy this

i'll just copy this from sklearn you can see over here

from sklearn you can see over here right

right and then you can basically do it with

and then you can basically do it with respect to df dot

respect to df dot age

and df.h so you can execute this and automatically it will work

and automatically it will work do not hesitate to google

do not hesitate to google it is up to you

right it is up to you

it is up to you so you can also do this in this way this

so you can also do this in this way this is the second technique

is the second technique so i have already done this now if i

so i have already done this now if i probably go and see my df dot head

probably go and see my df dot head you will be able to see in age also i'll

you will be able to see in age also i'll be okay i have not executed data

be okay i have not executed data so i have executed it now if i go and

so i have executed it now if i go and see my df.head you will be able to see

see my df.head you will be able to see one two three four like that you will be

one two three four like that you will be able to see

able to see see there will be hundred of ways label

see there will be hundred of ways label encoder fit transform for the test data

encoder fit transform for the test data you have to do transform

you have to do transform but here i've actually combined it so

but here i've actually combined it so this is not a good practice

this is not a good practice for this case

for this case suppose if i'm doing for trained data or

suppose if i'm doing for trained data or touch data i will just transform it no

touch data i will just transform it no need to give any any any weightage guys

need to give any any any weightage guys arvinds see

arvinds see our machine learning model will

our machine learning model will automatically understand so one more

automatically understand so one more category that i have actually we have

category that i have actually we have actually seen

is something called as city categories see oh yes city category is also there

see oh yes city category is also there so for this i will just use pd dot get

so for this i will just use pd dot get dummies if you want pd dot get dummies

dummies if you want pd dot get dummies and then you can basically combine them

and then you can basically combine them but in order to do that also

but in order to do that also what you will do so here i can basically

what you will do so here i can basically say that pd dot

say that pd dot get dummies

get dummies and then i'm basically going to give my

and then i'm basically going to give my df off

df off city name is city category so

city name is city category so fixing

categorical categorical

categorical city underscore category

city underscore category dot get dummies df or city category and

dot get dummies df or city category and here i'm going to basically say

drop first is equal to true so here i have all my values so i'm just

so here i have all my values so i'm just going to save this in one variable where

going to save this in one variable where i am going to say df

i am going to say df underscore city

underscore city let's say

let's say so df underscore city is this one

so df underscore city is this one dot head

dot head now i have to combine this entire cities

now i have to combine this entire cities with this df okay which i have actually

with this df okay which i have actually shown you

shown you before

before and now

and now i hope everybody has done till here so

i hope everybody has done till here so this two features will now get compiled

this two features will now get compiled to this data set

to this data set now in order to get combined into this

now in order to get combined into this particular data set what i will write is

particular data set what i will write is that i will say

that i will say pd.concat

pd.concat and then here i'm basically going to

and then here i'm basically going to give df

give df and df underscore city

and df underscore city and when i'm doing concatenation i also

and when i'm doing concatenation i also have to give my axis value as 1

have to give my axis value as 1 so this i will save it in my df

so this i will save it in my df and this will basically be my df.head

and this will basically be my df.head so if i go probably in the last year you

so if i go probably in the last year you will be able to see b and c

will be able to see b and c now i don't require this city category i

now i don't require this city category i can drop the city category but i hope

can drop the city category but i hope everybody is able to understand

so why drop underscore first is equal to true because always understand if i have

true because always understand if i have three categories

three categories two categories is sufficient to

two categories is sufficient to represent all the three categories now

represent all the three categories now let me go to the next step let me

let me go to the next step let me quickly

quickly drop

drop so drop

so drop i'll i'm just going to write drop

i'll i'm just going to write drop city category

city category because i don't require this feature now

because i don't require this feature now right

right i don't require this feature city

i don't require this feature city category right

so i'm just going to do df.drop and here i'm just going to basically say

i'm just going to basically say uh my category name which is city

uh my category name which is city category

category but again i understand here your access

but again i understand here your access will be one

will be one so

so what is the error not found in access

what is the error not found in access why

why okay it is city underscore category guys

okay it is city underscore category guys understand why we are doing this because

understand why we are doing this because any new data will come we have to again

any new data will come we have to again follow this entire thing

follow this entire thing okay this is entire steps you have to

okay this is entire steps you have to follow whatever things we have actually

follow whatever things we have actually done this encoding everything will be

done this encoding everything will be done so here you can see df dot drop

done so here you can see df dot drop city category axis is equal to 1 that

city category axis is equal to 1 that particular feature is gone so what i am

particular feature is gone so what i am actually going to do to make this

actually going to do to make this operation permanently i'm going to use

operation permanently i'm going to use in place

in place is equal to true

is equal to true so if i now go and probably check

so if i now go and probably check df.head

df.head here it is entirely

here it is entirely so bc is there this is there

so bc is there this is there so we have fixed all these things still

so we have fixed all these things still here

here we have

we have done a better work till here

done a better work till here now let's go and check the missing

now let's go and check the missing values

missing values city category is a category feature uh

city category is a category feature uh pt category one is another is age and

pt category one is another is age and one is uh gender so three categories we

one is uh gender so three categories we have fixed up

have fixed up axis is equal to one basically means

axis is equal to one basically means column wise we are adding or we are

column wise we are adding or we are appending that specific data frame

appending that specific data frame in this axis is equal to one basically

in this axis is equal to one basically means we are deleting the column

means we are deleting the column guys again i have told you eda basics

guys again i have told you eda basics the prerequisite is that you need to

the prerequisite is that you need to know python

know python you need to know some basic things

you need to know some basic things if you are not knowing it

if you are not knowing it difficult now with respect to diff uh

difficult now with respect to diff uh df dot is null missing values what i'm

df dot is null missing values what i'm actually going to do i'm just going to

actually going to do i'm just going to do sum

do sum df dot is null dot sum

df dot is null dot sum this is also function now here you can

this is also function now here you can see product category has so many null

see product category has so many null values

values purchase also has so many null values

purchase also has so many null values product category 2 has so many null

product category 2 has so many null values product category 3 has so many

values product category 3 has so many null values purchase has so many null

null values purchase has so many null values

values amazing

amazing now whenever null values are there

now whenever null values are there people will get shocked what to do now

people will get shocked what to do now everybody will get shocked what to do

everybody will get shocked what to do now

now okay categories are there should we

okay categories are there should we replace categories with something just

replace categories with something just tell me

tell me purchase y null are there because this

purchase y null are there because this is the test data the null values that

is the test data the null values that are present that is the test data that

are present that is the test data that should be null only but this two we

should be null only but this two we should definitely fix it up right

should definitely fix it up right we should this two we should definitely

we should this two we should definitely fix it up so what i'm actually going to

fix it up so what i'm actually going to do will focus on

do will focus on focus on

focus on replacing

replacing missing values

missing values focus on replacing missing values

focus on replacing missing values now when i focus on replacing the

now when i focus on replacing the missing values what i'm going to do i'm

missing values what i'm going to do i'm going to basically replace the missing

going to basically replace the missing values for this two feature so we have

values for this two feature so we have to do some kind of data exploration for

to do some kind of data exploration for these two features so what i'm actually

these two features so what i'm actually going to do i'm going to basically write

going to do i'm going to basically write df dot

df dot product category now tell me guys

product category now tell me guys if i write dot

if i write dot unique

unique tell me what kind of feature this

tell me what kind of feature this becomes what kind of features this

becomes what kind of features this becomes

becomes or if i write

or if i write dot underscore door underscore2.unic

dot underscore door underscore2.unic what kind of features this will become

what kind of features this will become will this become a discrete feature

will this become a discrete feature discrete categorical discrete continuous

discrete categorical discrete continuous feature

feature or whether this will become a continuous

or whether this will become a continuous feature this will become a discrete

feature this will become a discrete feature guys see discrete because this

feature guys see discrete because this is only getting repeated

is only getting repeated this will only get repeated so for the

this will only get repeated so for the people who have attended my start

people who have attended my start session will definitely know this right

session will definitely know this right so they will be definitely focusing on

so they will be definitely focusing on and they'll be knowing this entire thing

and they'll be knowing this entire thing okay so over here here you you can

okay so over here here you you can specifically see that this will be a

specifically see that this will be a discrete feature okay this will entirely

discrete feature okay this will entirely be a discrete feature now in a discrete

be a discrete feature now in a discrete feature if i have a nand value what is

feature if i have a nand value what is the best way to replace the missing

the best way to replace the missing values tell

values tell me quickly now this this this should be

me quickly now this this this should be a lot of discussions that needs to be

a lot of discussions that needs to be done on this

done on this so

so tell me what should be a better way to

tell me what should be a better way to replace the missing values what i will

replace the missing values what i will also do is that i'll make your work

also do is that i'll make your work little bit easy i will also write

little bit easy i will also write product category 2 and i will say value

product category 2 and i will say value counts

counts value counts basically will give me

value counts basically will give me all the values that are present with

all the values that are present with respect to this okay

respect to this okay value underscore counts so here you can

value underscore counts so here you can see eight is basically having this many

see eight is basically having this many values

values four is basically having this many

four is basically having this many records six is having this many records

records six is having this many records what do you think if i want to replace

what do you think if i want to replace the nand values what is the best way to

the nand values what is the best way to replace in this feature

replace in this feature so here what we will do with respect to

so here what we will do with respect to any categorical features or discrete

any categorical features or discrete feature the best way is to replace

feature the best way is to replace the missing value

missing value with mode so in order to replace the

with mode so in order to replace the missing value

missing value with mode okay mean don't use mean guys

with mode okay mean don't use mean guys because mean will create a new category

because mean will create a new category altogether so in order to replace the

altogether so in order to replace the mode that is very much simple

mode that is very much simple and how do you do it just let me know

and how do you do it just let me know how do you replace

how do you replace that with mode tell me guys

so first of all i'll write a simple code for you

i will say df of

product product

product category two okay please think over it try to write the

please think over it try to write the code guys okay

code guys okay so here i will definitely use something

so here i will definitely use something called as fill name fill in a function

called as fill name fill in a function is already there which i have also

is already there which i have also mentioned or explained in my

mentioned or explained in my lot of lectures so i'll say dot fill n a

lot of lectures so i'll say dot fill n a and here i'm basically going to say df

and here i'm basically going to say df of

of product category

product category to

to dot mode right see if i if i basically

dot mode right see if i if i basically just copy this entire thing okay

just copy this entire thing okay and if i write df of product category

and if i write df of product category dot mode

what will be the output that i will get i will get this 2 output 8.0 so if i

i will get this 2 output 8.0 so if i want to find out the mode what i have to

want to find out the mode what i have to do i have to basically write something

do i have to basically write something like this

like this for this right

for this right now here i'm getting two values one is

now here i'm getting two values one is zero and one is eight point zero so this

zero and one is eight point zero so this becomes a series now in order to pick up

becomes a series now in order to pick up this value i can basically use indexing

this value i can basically use indexing so if i use this then i'll be getting

so if i use this then i'll be getting 8.0 okay so here what i'm going to do

8.0 okay so here what i'm going to do after this i'm going to basically just

after this i'm going to basically just copy this entire thing over here

copy this entire thing over here dot mode

dot mode so once i do this this will get

so once i do this this will get reflected over here

reflected over here and now

and now if i probably write df of

if i probably write df of product

product category 2 dot

category 2 dot is null

is null dot sum

dot sum so here you can see that now my values

so here you can see that now my values are 0 that basically means the

are 0 that basically means the replacement has happened

replacement has happened clear

interesting problem the data set is also quite huge

quite huge now similarly what i'll do for

now similarly what i'll do for product category 3 because there are 54

product category 3 because there are 54 000

000 okay so we will also do it for product

okay so we will also do it for product category 3.

okay product product

product category 3

category 3 category

category 3

3 replace missing values

again i'm going to paste it over here i'm just going to write dot three

i'm just going to write dot three underscore three dot unique so here also

underscore three dot unique so here also i see 1417 this this is there if i want

i see 1417 this this is there if i want to also want to see how what is the

to also want to see how what is the value counts

value counts i can basically write dot value counts

dot value counts so here is all my values with respect to

so here is all my values with respect to this particular value counts okay

this particular value counts okay so

so let's go ahead and replace it

let's go ahead and replace it replace with missing values with modes

replace with missing values with modes and

and again i'm going to going to copy

again i'm going to going to copy please playing the missing value so here

please playing the missing value so here it is

okay i'm going to just use it with product category 3

so now if i execute it now if i go and probably see my df.head

okay so here it is everything so product categories three this this is

so product categories three this this is fixed

fixed why shouldn't we just remove

why shouldn't we just remove product categories the reason is that

product categories the reason is that because

because if i go and see df dot shape

if i go and see df dot shape here you have around 7 lakh 83 000

here you have around 7 lakh 83 000 records

records and around 5 lakhs records are basically

and around 5 lakhs records are basically missing

missing if you see over here

five lakhs two lakhs are there you cannot just drop it okay

cannot just drop it okay probably that may be an important

probably that may be an important information

so you said for purchase column then missing values are fine because it is

missing values are fine because it is for test data but train test split is

for test data but train test split is random no titration we don't just do

random no titration we don't just do random we do cross validation

random we do cross validation let's go to the next step

let's go to the next step anything that is left

anything that is left one more category is this one right stay

one more category is this one right stay in current years

in current years so what do you think we should do for

so what do you think we should do for this so here if i say

this so here if i say hashtag

hashtag stay for current years right stay in

stay for current years right stay in current city years

current city years so if i write df of

so if i write df of state current city yes

state current city yes if i write dot unique

if i write dot unique so here i am having 2 4 plus 3 1 0 okay

so here i am having 2 4 plus 3 1 0 okay so what we can actually do we can also

so what we can actually do we can also consider this as 4 only right because

consider this as 4 only right because anyhow if it is 4 plus also it will be

anyhow if it is 4 plus also it will be treated as 4 it can be treated as 4 if

treated as 4 it can be treated as 4 if it is value is also increasing it is

it is value is also increasing it is fine right so what we can do is that we

fine right so what we can do is that we can replace this 4 plus with 4. now tell

can replace this 4 plus with 4. now tell me how to do it

me how to do it so i will write tf of

so i will write tf of stay in current

stay in current years

years dot htr dot replace

dot htr dot replace and then i'm actually going to replace

and then i'm actually going to replace plus with

plus with with blank right

so if i do this i will probably be able to find out all these things

to find out all these things right

right so this entirely i can save it inside my

so this entirely i can save it inside my df dot

df dot stay in current

so if i execute over here done some warning is there but it's okay

so here you can see this now i don't have four plus i've fixed it

now i don't have four plus i've fixed it another category any more categories

another category any more categories today we are just focusing on solving

today we are just focusing on solving categories

categories now

now let's do one thing okay

let's do one thing okay now even though we are basically

now even though we are basically checking categories we are we are

checking categories we are we are basically checking other other things

basically checking other other things over here right if i probably just go

over here right if i probably just go and write df.info

and write df.info so here we we are seeing that product id

so here we we are seeing that product id is an object that is fine

is an object that is fine gender it has an integer that is fine

gender it has an integer that is fine age is an integer occupation is an

age is an integer occupation is an integer stay in current city years is

integer stay in current city years is also an object

also an object but here you can see that i am having

but here you can see that i am having values like 2 2 4 4 4. so we need to

values like 2 2 4 4 4. so we need to convert this object into integers that

convert this object into integers that is a major step

is a major step that we have to actually do so what we

that we have to actually do so what we are actually going to do over here is

are actually going to do over here is that we have to convert this

that we have to convert this which is an object into integers so how

which is an object into integers so how to do that

to do that so convert

so convert because this kind of task also you may

because this kind of task also you may be getting

be getting convert object

convert object into integers can quick anybody tell me

into integers can quick anybody tell me how to do it

how to do it it's very simple

it's very simple here i'm just going to write df of

here i'm just going to write df of stay in current city years is equal to

stay in current city years is equal to df off

df off stay in current city years dot as type

stay in current city years dot as type as type integer

if i do this done

now if i write df dot head or df dot info you will be able to see

or df dot info you will be able to see this

this so here you can see

so here you can see stain current is basically assigned in

stain current is basically assigned in 32 you can also assign in 64 by

32 you can also assign in 64 by providing in 64 directly over here okay

providing in 64 directly over here okay there are two more columns which has b

there are two more columns which has b and c as u u int 8

and c as u u int 8 q intake what is u intent

q intake what is u intent u int 8

it is an 8 bit assigned integer ranging between 0 to 255 decimals it's okay you

between 0 to 255 decimals it's okay you can also convert that into in type

can also convert that into in type so if you want to convert that into in

so if you want to convert that into in type i will just use this two quotes

type i will just use this two quotes so what i will do here i will say b and

so what i will do here i will say b and c right so df of b

c right so df of b is equal to df of b

dot as type int

int and same thing i can copy and paste it

and same thing i can copy and paste it for dfr

for dfr c

now if i go and probably see my df.info you will be able to see this

you will be able to see this now

now once we have done this the best

once we have done this the best visualization what i feel

visualization what i feel visualization

visualization is present in cbot

which is called as sns dot pair plot

sns dot pair plot if i give pair plot and just give df

if i give pair plot and just give df see what is the amazing diagram

see what is the amazing diagram but it will take lot of time because

but it will take lot of time because there are so many data points

there are so many data points along with that so many data sets

along with that so many data sets okay it is giving me an error let's say

okay it is giving me an error let's say what is the error

what is the error cannot reindex from a duplicate access

cannot reindex from a duplicate access df dot

duplicate points why this error has come

it's okay see if there is something like a product type right

a product type right that will

that will actually get removed in the pair plot

actually get removed in the pair plot that is the

that is the reason why do we use this

will give an error i'll have a look on to this okay don't

i'll have a look on to this okay don't worry

worry till then let's see some other

till then let's see some other visualization diagrams okay

visualization diagrams okay till then let's see other visualization

till then let's see other visualization diagrams

diagrams so i i'll just have a look why that

so i i'll just have a look why that probably is not coming but i can

probably is not coming but i can definitely use another plot like bar

definitely use another plot like bar plot

plot let's say that i'm using bar plot and i

let's say that i'm using bar plot and i want to basically compare

want to basically compare age with respect to purchase so this

age with respect to purchase so this will actually help you to find out

will actually help you to find out who has

who has purchased more or who has purchased less

purchased more or who has purchased less and here there is a gender over here so

and here there is a gender over here so i'm just going to use a hui as gender

okay i've done some observation over here and data is equal to df

here and data is equal to df okay

okay so let's execute this

so this is the diagram that you are getting

getting so age

so age one two three four five so which you can

one two three four five so which you can basically map with it

basically map with it but definitely you can see that even 55

but definitely you can see that even 55 plus

plus with respect to genders so from this

with respect to genders so from this observation

observation if gender zero

if gender zero zero what we have replaced with male

zero what we have replaced with male right

gender uh zero we had replaced with female or male

very gender very gender i think for male for female we have made

i think for male for female we have made it to zero yeah

it to zero yeah so from this definitely you can come up

so from this definitely you can come up with some conclusion that whether female

with some conclusion that whether female has bought more or male has bought more

has bought more or male has bought more but over here with respect to the

but over here with respect to the purchases maximum amount of purchases

purchases maximum amount of purchases you can see that uh mail has a huge

you can see that uh mail has a huge purchase

purchase with respect to the orders also we'll

with respect to the orders also we'll try to see will with respect to

try to see will with respect to different different orders

different different orders this is nothing but visualization of age

this is nothing but visualization of age versus purchase so please write down

versus purchase so please write down your observations what do you feel with

your observations what do you feel with respect to this kind of things

respect to this kind of things purchasing of goods of each range of age

purchasing of goods of each range of age are almost equal

are almost equal but we can conclude definitely that the

but we can conclude definitely that the purchasing percentage of purchasing

purchasing percentage of purchasing goods of men over women is high

right is this possible no

no [Laughter]

[Laughter] purchasing of

purchasing of men over

men over men is high

men is high then women

then women so this is the observation that i have

so this is the observation that i have done

done which is not at all

which is not at all possible right

possible right but data does not lie right

but data does not lie right so definitely

so definitely all the other purchases with respect to

all the other purchases with respect to the ages are uniform but purchasing of

the ages are uniform but purchasing of men is higher than women

men is higher than women yeah

yeah nice

nice i like it

i like it so this is my first observation

so this is my first observation let's say with respect to purchase we'll

let's say with respect to purchase we'll try to visualize the occupation okay so

try to visualize the occupation okay so visualization of

visualization of purchase

purchase with occupation

with occupation so i'm just going to copy the same thing

and i'm going to paste it over here so here i'm just going to say it as

so here i'm just going to say it as occupation

ah let's see the diagram this will be quite huge because it will

this will be quite huge because it will be stuffed right

be stuffed right occupations are many right

occupations are many right occupations are money so you can just go

occupations are money so you can just go and check out which all occupations are

and check out which all occupations are there at 20 different occupations so

there at 20 different occupations so from this data set you will be able to

from this data set you will be able to find it out the initial data set

find it out the initial data set and you can make some observations from

and you can make some observations from this

let's see what is occupation occupation with this some categories are

occupation with this some categories are mapped okay so with respect to this

mapped okay so with respect to this i'll i'd suggest that this is also

i'll i'd suggest that this is also uniform

uniform it won't affect a lot let's compare

it won't affect a lot let's compare whether product category 1

whether product category 1 product category one versus persist like

product category one versus persist like many people have bought product category

many people have bought product category one because if you go and see the data

one because if you go and see the data set then we'll be able to see it over

set then we'll be able to see it over there so i'm just going to copy this

there so i'm just going to copy this with the bar plot

with the bar plot i'm going to write it over here

i'm going to write it over here and i'm going to basically write product

and i'm going to basically write product category one product so let's see

category one product so let's see product category one how many people

product category one how many people have bought

have bought with respect to the purchases so that

with respect to the purchases so that amount will be shown

amount will be shown so here you can see this is the graph

so here you can see this is the graph with respect to product category 1.

with respect to product category 1. similarly let's see with respect to

similarly let's see with respect to product category 2 i don't know whether

product category 2 i don't know whether we'll be able to see it or not

we'll be able to see it or not in the same thing we can see it

in the same thing we can see it two graphs

two graphs two graphs will not be able to see it i

two graphs will not be able to see it i guess

no only one is coming okay i will remove this i think it will

okay i will remove this i think it will replace

replace in that same order

in that same order okay

okay so i will just execute this product

so i will just execute this product category one

category one and then this will be my product

and then this will be my product category two

and the next one is my product category 3

3 but observe this and come up with some

but observe this and come up with some conclusion guys

here you can see with respect to 12 000 is there here till 14 to 16 000 product

is there here till 14 to 16 000 product category 2 is sold what more whereas

category 2 is sold what more whereas product category 1 is bought the most

product category 1 is bought the most right it is still 20 000 right

so definitely that information you can take it out from this particular graph

any other graphs that you want to propose but you can definitely use this

propose but you can definitely use this tell me guys is this mo is this data set

tell me guys is this mo is this data set good for the model or not now

good for the model or not now because the type of database processing

because the type of database processing we have done

we have done i think we are good to do it right

we are good to do it we can also drop product id

product id we can also drop product id

now let's probably do the one last thing okay

okay that is feature scaling

that is feature scaling okay feature scaling

okay feature scaling this will now become my df underscore

this will now become my df underscore test

test and then

and then i can remove

i can remove df dot purchase dot is null

see wherever the purchase in the purchase column it is null right that

purchase column it is null right that all belongs to the test data so i'm just

all belongs to the test data so i'm just trying to find out

trying to find out apart from is null how do i find out

apart from is null how do i find out if it is not null

if it is not null so if you use like this

so if you use like this so here by this you will be able to see

so here by this you will be able to see this and here you can basically write df

this and here you can basically write df run train

run train so now you have your df draw train and

so now you have your df draw train and df underscore test

so df underscore train and test you have now let's go to the feature scaling

now let's go to the feature scaling in the future scaling how do you do it

in the future scaling how do you do it we basically apply standard scalar as a

we basically apply standard scalar as a feature scaling so for that it's very

feature scaling so for that it's very much simple

much simple from

from sklearn

dot pre-processing

pre-processing i'm going to import standard scalar

i'm going to import standard scalar and then i'm going to write sc is equal

and then i'm going to write sc is equal to standard scalar

to standard scalar and on my trained data set always

and on my trained data set always remember

df underscore test okay before that if you want to do train

okay before that if you want to do train test split definitely go ahead and do it

test split definitely go ahead and do it i don't have any problem so i can

i don't have any problem so i can definitely write df underscore train

definitely write df underscore train you can do that x train x test y train y

you can do that x train x test y train y test it's up to you okay

test it's up to you okay so

so uh before this let me write one code

uh before this let me write one code where we will do the train test plate

where we will do the train test plate for the training data so here what i am

for the training data so here what i am going to do scale on

going to do scale on train test split okay it is always good

train test split okay it is always good to google it and copy and paste and do

to google it and copy and paste and do it instead of writing it okay

it instead of writing it okay so i'm just going to copy this

so i'm just going to copy this and paste it over here

and paste it over here you also do that same thing don't tell

you also do that same thing don't tell me krish bring me the

me krish bring me the queries or answers so here i'm just

queries or answers so here i'm just going to change it but before that let

going to change it but before that let me

me write from sk learn dot

write from sk learn dot model selection import

model selection import trend test split

trend test split okay so here will basically be my

okay so here will basically be my df underscore test

df underscore test my x

my x so my x will basically be df underscore

train colon minus 1 i hope so it works

so x dot head

just make our x and y axis so that it will get our independent and dependent

will get our independent and dependent feature so this is my x similarly for my

feature so this is my x similarly for my y what i will do

y what i will do i will also create my y where i'll write

i will also create my y where i'll write d f off

d f off colon no minus one will basically give

colon no minus one will basically give my

my okay minus one is not there okay colon

okay minus one is not there okay colon minus one colon

minus one colon colon

colon how do we get the last column

no df of colon minus one will give you the entire

colon minus one will give you the entire thing

thing double colon minus one

double colon minus one so this will basically give your last

so this will basically give your last value

value no it is not giving

no it is not giving double colon no just a second

double colon no just a second i can basically say it as df of

i can basically say it as df of purchase

right so this is my y value

so this is my y value colon comma minus 1 will also work

colon comma minus 1 will also work colon comma minus 1 will also work

just give me a second guys yeah

yeah so this is my x and y

so this is my x and y now what i'm going to do give it to my x

now what i'm going to do give it to my x and y here

and y here and here i will

and here i will get a error why

on input variable inconsistent number of samples comma 36 why

samples comma 36 why i made some mistake

x dot shape let's see

let's see this is basically having 12 rows that is

this is basically having 12 rows that is fine

fine what is this having

hey how come difference is there my mistake

my mistake it should be

it should be df underscore

df underscore train

train that was the mistake that i made

that was the mistake that i made fine now it will work

fine now it will work now i've got the same answer here i'll

now i've got the same answer here i'll basically go and execute it

d f underscore train that i have written but still i'm getting this error why

but still i'm getting this error why a scalar node model import train test

a scalar node model import train test split this this x comma y is there

split this this x comma y is there y is also here

y dot shape also i might be able to get it

same oh one extra record is there hook up

purchase is not the last column your screen is not visible properly looking

screen is not visible properly looking hazy

hazy then please reload it okay

oh purchase is not the last column that is the problem

so i made one mistake over here so what i will do is that i can

so what i will do is that i can basically say

basically say df dot train dot drop

df dot train dot drop of purchase

of purchase with axis is equal to 1 now this will

with axis is equal to 1 now this will definitely work

definitely work this is done

this is done so i have all my features over here

so i have all my features over here if i do x dot shape now i have 11

if i do x dot shape now i have 11 columns

columns then df underscore train this is there

then df underscore train this is there this is the perfect perfect perfectness

this is the perfect perfect perfectness it happens i'll google it it happens

it happens i'll google it it happens now it is fixed see

now it is fixed see now df underscore train instead of

now df underscore train instead of writing like this now i'm going to do

writing like this now i'm going to do fit

fit and

and fit transform on xtrain so finally i

fit transform on xtrain so finally i will write

will write sc.fit fit underscore transform

sc.fit fit underscore transform and here i'm basically going to write x

and here i'm basically going to write x underscore train

underscore train which will basically give me x

which will basically give me x underscore train is equal to this one

x underscore test is equal to sc dot

sc dot transform on

transform on x underscore y transform think over it

x underscore y transform think over it so let's execute this

so let's execute this and again it gives me an error why could

and again it gives me an error why could not okay let's drop one last thing

not okay let's drop one last thing from this

from this i think i could have dropped in df.train

i think i could have dropped in df.train only and df.test only so that drop it

only and df.test only so that drop it that will be an assignment to you all

that will be an assignment to you all i'm going to drop the

i'm going to drop the product id

product id in place is equal to true

in place is equal to true i don't want to get killed right now

product id this is this done

done finished so

finished so 92 lines of code more than 100 lines of

92 lines of code more than 100 lines of code i've written in front of you

code i've written in front of you did the complete analysis

did the complete analysis now this is your data set go and train

now this is your data set go and train your model the next step is basically

your model the next step is basically train your

train your model that's it

model that's it if you want to

if you want to see correlation and all

see correlation and all okay so here i will just name this file

okay so here i will just name this file as

as black friday

black friday and

and feature engineering everything i'll be

feature engineering everything i'll be giving you i will be uploading this in

giving you i will be uploading this in my github so that

my github so that you will be able to find it out

you will be able to find it out just a second i'm doing it i'm uploading

just a second i'm doing it i'm uploading it okay guys so just uh reload the page

it okay guys so just uh reload the page and uh yes you will be able to see the

and uh yes you will be able to see the file in the description so tomorrow also

file in the description so tomorrow also we are going to take up any other

we are going to take up any other different data set and then we are

different data set and then we are trying to see that how things are going

trying to see that how things are going just reload the data set and tomorrow

just reload the data set and tomorrow we'll continue the session

we'll continue the session uh thank you everyone for joining and

uh thank you everyone for joining and yes i hope you liked it so thank you

yes i hope you liked it so thank you have a great day bye bye guys keep on

have a great day bye bye guys keep on rocking

rocking we'll see you tomorrow hello guys i hope

we'll see you tomorrow hello guys i hope everybody is able to hear me out

everybody is able to hear me out so from that today we are basically

so from that today we are basically going to solve

flight price

price prediction

prediction and here we are basically going to do

and here we are basically going to do eda

eda eda plus feature engineering

eda plus feature engineering so data set here i'm actually giving you

so data set here i'm actually giving you the data set

the data set so if you go and see the data set the

so if you go and see the data set the data set looks something like this data

data set looks something like this data train test set okay

train test set okay two xls file

two xls file will be there

will be there so you have to download these two files

so you have to download these two files if you want to download make sure that

if you want to download make sure that go to this download it

go to this download it right as a zip file and inside flight

right as a zip file and inside flight prediction we have this specific data

prediction we have this specific data set

set these two data set we are going to take

these two data set we are going to take it up data train and test underscore set

it up data train and test underscore set and this problem statement was given

and this problem statement was given this flat price prediction problem

this flat price prediction problem statement was given in a hackathon that

statement was given in a hackathon that we are going to basically solve over

we are going to basically solve over here and let's start so initially we'll

here and let's start so initially we'll start with importing some basic

start with importing some basic libraries

libraries importing basic libraries

importing basic libraries quickly do it which all libraries we

quickly do it which all libraries we require already we have done in study

require already we have done in study session i'll write import pandas as pd

session i'll write import pandas as pd import numpy

import numpy as np

as np then import

then import matplotlib

matplotlib dot pi plot

pi plot as plt

as plt and then import

c bond as a sns

as a sns import cbon as a sns and then probably

import cbon as a sns and then probably we will also be importing

we will also be importing will write matpotlib inline

will write matpotlib inline now guys many people usually ask me what

now guys many people usually ask me what is this used for matplotlib inline

is this used for matplotlib inline see suppose if you want to probably show

see suppose if you want to probably show the diagram within this

the diagram within this without writing plot dot show

without writing plot dot show so you can basically go with respect to

so you can basically go with respect to this one matplotlib inline so as soon as

this one matplotlib inline so as soon as you plot anything you don't have to

you plot anything you don't have to write plot dot show and automatically it

write plot dot show and automatically it will get shown over here itself

will get shown over here itself so uh

so uh now why i have specifically taken this

now why i have specifically taken this data set because if we go and see this

data set because if we go and see this data set

data set there is something very amazing about

there is something very amazing about this data set because it also has

this data set because it also has date time information okay

date time information okay so date time information you have to

so date time information you have to really be careful whenever you are

really be careful whenever you are working at it so that is the reason why

working at it so that is the reason why i have specifically taken this uh

i have specifically taken this uh because i wanted to show you different

because i wanted to show you different different domain problem statements kind

different domain problem statements kind of data so that you will be able to see

of data so that you will be able to see okay what are challenges you may

okay what are challenges you may probably face into it so as usual what

probably face into it so as usual what i'm actually going to do first of all uh

i'm actually going to do first of all uh i'm going to just

i'm going to just import the training data set

import the training data set which i will write pd.read underscore

which i will write pd.read underscore csv

read underscore excel so let me just execute this one first

execute this one first so read the data set like this

so read the data set like this and here i'm basically going to give my

and here i'm basically going to give my datatrain.xls

datatrain.xls and if i go and probably see my train

and if i go and probably see my train underscore df.head you will be able to

underscore df.head you will be able to see this specific data set

see this specific data set so here you have airline date of journey

so here you have airline date of journey source destination

source destination route

route if it is given like this bangalore to

if it is given like this bangalore to delhi

delhi departure time arrival time duration

departure time arrival time duration total stops

total stops additional info price

additional info price so after this what we have to probably

so after this what we have to probably do is that

do is that same thing i'll do it for the test data

same thing i'll do it for the test data set so here i'm going to basically do it

set so here i'm going to basically do it for test data set

for test data set so test

so test uh test underscore df

uh test underscore df and specifically here i will write

and specifically here i will write test xls

test xls this is the file name

this is the file name and if i want to

and if i want to display test df.head so here is my test

display test df.head so here is my test data only one column will not be there

data only one column will not be there which is this last column that you can

which is this last column that you can see that is price

see that is price so this both are done

so this both are done i hope everybody is done

i hope everybody is done now as usual after importing

now as usual after importing i did not try

i did not try training the model see if if you are

training the model see if if you are getting model score bad like 12 13 with

getting model score bad like 12 13 with the help of linear regression

the help of linear regression or other algorithms try different

or other algorithms try different algorithms right like other algorithms

algorithms right like other algorithms are also there like decision tree

are also there like decision tree regressor random forest regressor

regressor random forest regressor right you have xgb boost regressor

right you have xgb boost regressor no one tried that i don't know you're

no one tried that i don't know you're just saying 12

just saying 12 and 13 for linear and lasso and you're

and 13 for linear and lasso and you're just keeping quite that is the problem

just keeping quite that is the problem with you all

with you all you know where i've taught all the

you know where i've taught all the machine learning algorithms previously

machine learning algorithms previously why you don't want to try with other

why you don't want to try with other machine learning algorithm obviously

machine learning algorithm obviously linear regression creates a straight

linear regression creates a straight line and there you have so many features

line and there you have so many features so your accuracy will be bad see if you

so your accuracy will be bad see if you don't get this much common sense then at

don't get this much common sense then at that point of time i think

that point of time i think trust me for cracking interviews it will

trust me for cracking interviews it will become difficult how you will work in

become difficult how you will work in the real world industry

the real world industry so if you go and use different different

so if you go and use different different algorithms so i i always tell you do

algorithms so i i always tell you do hyper parameter tuning on top of it i i

hyper parameter tuning on top of it i i just did linear regression sir rich sir

just did linear regression sir rich sir i got 12 percent not tell me what to do

i got 12 percent not tell me what to do i don't want to do anything

i don't want to do anything like that you'll learn tomorrow you'll

like that you'll learn tomorrow you'll given a problem statement how you'll do

given a problem statement how you'll do that

that at that time krishnak will not come

at that time krishnak will not come right

right so

so let's do one thing first of all i'm just

let's do one thing first of all i'm just going to combine

going to combine this

this train df and test df into another

train df and test df into another variable called as final df so what i'm

variable called as final df so what i'm going to do in order to combine i'll

going to do in order to combine i'll just write trend df dot append

and ndf dot append

ndf dot append of test df so test df is my this data

of test df so test df is my this data set and train df is this data set so

set and train df is this data set so once i will do this

once i will do this i can go and finally write final

i can go and finally write final underscore df.head

underscore df.head so this what i'm doing i'm combining

so this what i'm doing i'm combining both the train and test

both the train and test remember

remember if i go and see the tail path if i go

if i go and see the tail path if i go and see the tail part

then you will be able to see that you will have some nan values in the

you will have some nan values in the prices this is because of the test data

prices this is because of the test data set okay so this much i think you will

set okay so this much i think you will be able to do it

be able to do it appending the data set which is getting

appending the data set which is getting converted into this one now see the

converted into this one now see the features looks quite complex over here

features looks quite complex over here because the feature that you have is

because the feature that you have is like airlines you have date of journey

like airlines you have date of journey source destination

source destination route then departure time then arrival

route then departure time then arrival time you know arrival time then you have

time you know arrival time then you have duration then you have

duration then you have total stops then you have additional

total stops then you have additional info

info very you different different types of

very you different different types of columns are there so lot of feature

columns are there so lot of feature engineering is basically required and

engineering is basically required and here i'm just going to focus more on

here i'm just going to focus more on feature engineering because we have done

feature engineering because we have done extensive eda now let's go ahead and try

extensive eda now let's go ahead and try to do feature engineering on each and

to do feature engineering on each and every field okay

every field okay now the first field that you may

now the first field that you may probably see over here is something

probably see over here is something called a date of journey

called a date of journey now in this date of journey you have

now in this date of journey you have obviously

obviously you have a day you have months and you

you have a day you have months and you have year and probably just let me just

have year and probably just let me just write final underscore df.info

so here you can basically see that date of journey is also an object so date of

of journey is also an object so date of journey is an object that basically

journey is an object that basically means it is in the string format so we

means it is in the string format so we have to convert that into a date time

have to convert that into a date time format now this after converting

format now this after converting probably into a date time format what i

probably into a date time format what i will do is that

will do is that i i need to pick up this specific

i i need to pick up this specific information like day and this will

information like day and this will basically be my month and this may

basically be my month and this may probably be my year so this technique

probably be my year so this technique from this particular field i have to

from this particular field i have to create three more fields which will

create three more fields which will specify my day

specify my day month and year so here what do we say to

month and year so here what do we say to this is that we are trying to create a

this is that we are trying to create a derived feature now tell me guys from

derived feature now tell me guys from date of journey how do i create these

date of journey how do i create these three fields anyone you can actually try

three fields anyone you can actually try it out and you can basically

it out and you can basically let me know you can try it out you can

let me know you can try it out you can say some code how we should go ahead

say some code how we should go ahead with doing it so here basically i'm

with doing it so here basically i'm starting my future engineering process

and what i told that first i will try to take out or derive some features like

take out or derive some features like from this i will definitely be able to

from this i will definitely be able to take out day month and year how do we do

take out day month and year how do we do it

it so for that what i am actually going to

so for that what i am actually going to do it in a very simple way i'm basically

do it in a very simple way i'm basically going to say that final underscore df

going to say that final underscore df and i will try to create three features

and i will try to create three features as i said one feature will basically be

as i said one feature will basically be my

my month

month or date first i'll start with date

or date first i'll start with date so one feature will be this

so one feature will be this the next feature that i'm actually going

the next feature that i'm actually going to create is with respect to month

and the third feature that we are probably going to create

probably going to create is with respect to ear so this three

is with respect to ear so this three feature we need to derive and we need to

feature we need to derive and we need to create and how do we do it we already

create and how do we do it we already know that i have a feature which is

know that i have a feature which is called as date of journey right now from

called as date of journey right now from this date of journey i basically have to

this date of journey i basically have to split okay split by using what character

split okay split by using what character split by using this specific character

split by using this specific character that is this forward slash if i do

that is this forward slash if i do probably split then i will basically be

probably split then i will basically be able to get three important information

able to get three important information one is this six zero six and 2019 now in

one is this six zero six and 2019 now in the case of date i need to focus on the

the case of date i need to focus on the first index that is the zeroth index

first index that is the zeroth index then in in the case of month i need to

then in in the case of month i need to focus on the first index and in case of

focus on the first index and in case of 2019 i need to focus on the second index

2019 i need to focus on the second index so that is what i'm actually going to do

so that is what i'm actually going to do over here so i'm basically going to

over here so i'm basically going to write over here dot str

write over here dot str dot split because i have to convert that

dot split because i have to convert that into an str

into an str or if i need to basically do the split

or if i need to basically do the split and after doing the split if i copy this

and after doing the split if i copy this and if i run this code let's see what

and if i run this code let's see what will happen

will happen you will be able to see over here if i

you will be able to see over here if i write 0 that basically means i will be

write 0 that basically means i will be able to get this all entire information

able to get this all entire information okay so here you can see that if i write

okay so here you can see that if i write string

string sorry

sorry here i have written 0 then also i'm

here i have written 0 then also i'm getting this specific information what i

getting this specific information what i will do i'll also use one keyword called

will do i'll also use one keyword called dot htr of zero so here you can see that

dot htr of zero so here you can see that i'm able to get all the dates

i'm able to get all the dates okay so this is all my dates that i'm

okay so this is all my dates that i'm actually able to get so

actually able to get so in order to get the dates i'm just going

in order to get the dates i'm just going to use this and in forward i'm just

to use this and in forward i'm just going to write dot htr of 0

going to write dot htr of 0 so this is the this is the process that

so this is the this is the process that we can basically use to take out the

we can basically use to take out the date

date no need to convert into date or time

no need to convert into date or time also because once we get that we'll

also because once we get that we'll convert that into an integer

convert that into an integer then

then if i'm doing for forecasting kind of

if i'm doing for forecasting kind of task

task at that point of time i may use it then

at that point of time i may use it then for the month i need to just change the

for the month i need to just change the index to 1

index to 1 and for this i need to change the index

and for this i need to change the index to 2.

so here i will be able to get date month and year now if i execute you will be

and year now if i execute you will be able to see this

able to see this final underscore df dot head

final underscore df dot head and head i'll just see the top two

and head i'll just see the top two records here somewhere at the end you

records here somewhere at the end you will be able to see date month and year

will be able to see date month and year is created

is created this also works well you can apply a

this also works well you can apply a lambda function which is very very good

lambda function which is very very good so i'm just going to ping or copy paste

so i'm just going to ping or copy paste this code over here this is also a very

this code over here this is also a very good technique how to do it definitely

good technique how to do it definitely you can also do it with using this

you can also do it with using this so he has given this specific technique

so he has given this specific technique where he has specifically used lambda

where he has specifically used lambda function this will also definitely work

function this will also definitely work so i hope everybody is able to

so i hope everybody is able to understand till here okay so either of

understand till here okay so either of this code you can basically use

this code you can basically use and you can actually go ahead and do it

and you can actually go ahead and do it but this is a very good technique of

but this is a very good technique of applying a lambda function very nice

applying a lambda function very nice means efficient coding

means efficient coding okay it's all about googling and trying

okay it's all about googling and trying to find out a better way

to find out a better way that will definitely work

that will definitely work okay now let's see in the next step what

okay now let's see in the next step what we have to do simple it is that

we have to do simple it is that we have to basically also make sure that

we have to basically also make sure that we convert that into

we convert that into an integer right so integer also we need

an integer right so integer also we need to convert that date month date month

to convert that date month date month and year so in order to do this uh it's

and year so in order to do this uh it's very simple how do i do it i will just

very simple how do i do it i will just write

write final underscore df

is equal to final underscore df

final underscore df of

of date

date and i'm actually going to convert this

and i'm actually going to convert this into as type

into as type end okay

end okay then i'll copy this probably

i'll paste it i'll paste it i'll do it for

month and

and but one mistake i'm definitely making

but one mistake i'm definitely making over here i have to apply this to

over here i have to apply this to the same feature right

the same feature right so i'm just going to copy this here

here here i'll just make this to month

here i'll just make this to month and i'll just make this to here

and i'll just make this to here so once we do this and once we execute

so once we do this and once we execute this has got executed now if i write

this has got executed now if i write final underscore dot df.info

final underscore dot df.info and if i see

and if i see so here you can see date month and year

so here you can see date month and year is now in 32

is now in 32 in 32

in 32 price is already float 64 but we are

price is already float 64 but we are starting to focus on different different

starting to focus on different different features

features so uh we have done this uh let's go to

so uh we have done this uh let's go to the next feature now

the next feature now which one do you want to catch hold of

which one do you want to catch hold of the next feature since you have done it

the next feature since you have done it we'll do one more step is that we will

we'll do one more step is that we will try to drop this particular feature now

try to drop this particular feature now i don't require date of journey right

i don't require date of journey right now so what i'm actually going to do now

now so what i'm actually going to do now i'll just write

i'll just write final underscore df dot drop

final underscore df dot drop and here i'm basically going to give my

and here i'm basically going to give my feature name which is

feature name which is date off

i'll just copy this date of journey

with access is equal to 1

access is equal to 1 uh in place is equal to true this we

uh in place is equal to true this we have already seen

have already seen yesterday now if i go and probably see

yesterday now if i go and probably see my final underscore df

my final underscore df dot head of one

then here you can see month and year are there date is also there but you don't

there date is also there but you don't have any date of journey

have any date of journey now let's go to the next feature next

now let's go to the next feature next feature

feature see this is how we have to catch one

see this is how we have to catch one feature at a time and probably

feature at a time and probably do need the necessary changes okay

do need the necessary changes okay so the next feature basically uh we will

so the next feature basically uh we will go with respect to

go with respect to route

route let's say what we can do for this route

let's say what we can do for this route also will try to understand

also will try to understand okay arrival time

okay arrival time route

route okay route uh

okay route uh let's wait for some time for route let's

let's wait for some time for route let's focus on the arrival time or departure

focus on the arrival time or departure time

time okay so let's do one thing

okay so let's do one thing let's focus on arrival or departure time

let's focus on arrival or departure time first we'll focus on something and then

first we'll focus on something and then similar type of fields always remember

similar type of fields always remember when you are probably doing feature

when you are probably doing feature engineering try to catch up similar

engineering try to catch up similar types of field which we basically have

types of field which we basically have to do again and again let's go ahead and

to do again and again let's go ahead and take up arrival time now from this

take up arrival time now from this arrival time

arrival time what you can do is that obviously you

what you can do is that obviously you don't require this information like 22

don't require this information like 22 march

march if i probably go and see around 10

if i probably go and see around 10 records

records so here you will be able to see that

so here you will be able to see that wherever there is this gap

wherever there is this gap this space

this space how we can split it let's see

how we can split it let's see if we are using some space over here

if we are using some space over here we can definitely get something

we can definitely get something okay uh if you are using this space and

okay uh if you are using this space and probably trying to split it i will

probably trying to split it i will probably be able to get the arrival time

probably be able to get the arrival time my arrival time should be in such a way

my arrival time should be in such a way that i should be only able to get this

that i should be only able to get this first four

first four important information

important information think over it because i don't require

think over it because i don't require this 10 june and all because there is

this 10 june and all because there is date for that i don't require that i

date for that i don't require that i need to focus only on this first four

need to focus only on this first four values so how do i do it so i will write

values so how do i do it so i will write final underscore df

of arrival time dot

dot str

str dot split

dot split if i split with the help of

if i split with the help of an empty braces and if i write dot htr

an empty braces and if i write dot htr or if i just execute this here you will

or if i just execute this here you will be able to see like this

be able to see like this right

right so out of all these things i just need

so out of all these things i just need to pick up the first value see in the

to pick up the first value see in the first value i will be able to get all

first value i will be able to get all the important information

the important information okay like 4 25 7 15 only the first one i

okay like 4 25 7 15 only the first one i need to focus on so to get the first one

need to focus on so to get the first one i will just use indexing of htr of 0

i will just use indexing of htr of 0 and if i execute this now i will be able

and if i execute this now i will be able to get this particular value

to get this particular value to do the same thing there is also one

to do the same thing there is also one amazing code which can be done using

amazing code which can be done using this lambda function

this lambda function so here you can see dot apply lambda

so here you can see dot apply lambda this this

this this okay if i execute it

okay if i execute it sorry final underscore df

and execute it here you can also see that i'm getting the same information

that i'm getting the same information so what i'm actually going to do i'm

so what i'm actually going to do i'm going to use this particular code and

going to use this particular code and make that changes in final underscore df

make that changes in final underscore df of

of arrival time

any one of the code you can basically use and you can do it

use and you can do it more new new things you can basically

more new new things you can basically get it in order to do it

get it in order to do it one thing that i forgot to check whether

one thing that i forgot to check whether it has null value or not

so price basically has null values it's

price basically has null values it's okay that is for the test data route has

okay that is for the test data route has one null value

one null value total stops has one null value

total stops has one null value route

route that basically means route in that

that basically means route in that specific it may be the same row it may

specific it may be the same row it may be the other row but total stops has one

be the other row but total stops has one null value and this has one null value

now from this arrival time

from this arrival time we still have to

we still have to take out the hour

take out the hour and we still have to take out the

and we still have to take out the what we need to take out from this

what we need to take out from this arrival time guys hour and minutes right

arrival time guys hour and minutes right so that specific thing i will do next

so that specific thing i will do next step

step so here i'm actually going to write

so here i'm actually going to write final underscore df

final underscore df with the same

with the same lambda function or in in an easy way you

lambda function or in in an easy way you can basically do the split

can basically do the split and here i will actually create two more

and here i will actually create two more features

features arrival underscore hour

arrival underscore hour is equal to

is equal to final underscore df

arrival underscore time

time then you can use dot apply lambda or you

then you can use dot apply lambda or you can also do dot

can also do dot htr dot split

htr dot split and this split will now happen with

and this split will now happen with colon right because within the hours and

colon right because within the hours and this one colon is there

this one colon is there so i'm going to split with the help of

so i'm going to split with the help of colon

colon right

right so when i split with the help of colon

so when i split with the help of colon it will be dot htr dot split dot

it will be dot htr dot split dot htr of 0 if i write like this it will

htr of 0 if i write like this it will become my hour

become my hour and similarly if i want to

and similarly if i want to know the arrival

know the arrival minutes

then i can basically write like this and here

and here i will just write htr of one

i will just write htr of one done

done and if i go and probably see now final

and if i go and probably see now final underscore df

underscore df dot head of one

dot head of one you will be able to see this one

you will be able to see this one and here you have arrival of hour and

and here you have arrival of hour and minute

minute remember this is still in

remember this is still in object type so i also need to convert

object type so i also need to convert this into an integer type so same thing

this into an integer type so same thing if i go up i had written that specific

if i go up i had written that specific code how to do it i'll just copy this

code how to do it i'll just copy this one like this

okay i will copy the code over here and keep it over here and here i am going to

keep it over here and here i am going to basically write arrival of hour

basically write arrival of hour and convert this into in type

and convert this into in type and arrival minute

and arrival minute and convert this into in type

and convert this into in type so two steps one is converting into n

so two steps one is converting into n type is also done over here along with

type is also done over here along with this so if i execute it

this so if i execute it you will be able to now see that

you will be able to now see that if i write final underscore df dot info

if i write final underscore df dot info now you will be able to see that there

now you will be able to see that there are integer values added

are integer values added in arrival hour and arrival mean minutes

in arrival hour and arrival mean minutes so this is the code that i have actually

so this is the code that i have actually written

and then after that you can drop the arrival time

arrival time so

so here i will write final underscore

here i will write final underscore df.drop

arrival underscore time comma axis is equal to

comma axis is equal to 1

1 in place

step by step we are doing it in a nice way

way so i hope everybody is able to think

so i hope everybody is able to think so now if i probably go and see my final

so now if i probably go and see my final underscore df dot head of one record

underscore df dot head of one record here you will be able to see these

here you will be able to see these things are also there

things are also there okay uh what about

okay uh what about departure time i hope everybody will be

departure time i hope everybody will be able to do the same thing for the

able to do the same thing for the departure time just do it because

departure time just do it because departure is also having the same format

departure is also having the same format so i'm just going to copy all the code

so i'm just going to copy all the code paste it over here

paste this also over here now paste this to line also over here

and finally paste this also over here

paste this also over here and keep it with respect to

and keep it with respect to arrival time like that we had departure

arrival time like that we had departure time right

time right so i'm going to write departure

so i'm going to write departure time right depth time

time right depth time i'm going to copy this everywhere

paste paste

paste and here i'm going to basically write

and here i'm going to basically write the pt hour

dept hour

hour and this will be my dept minute

and this will be my dept minute so just by doing this i think everybody

so just by doing this i think everybody will be able to understand that we are

will be able to understand that we are going to change it now

done oh error is coming let's see

oh error is coming let's see with base 10 20 to 10.

[Music] oops

oops this should be department of hour

this should be department of hour and department of

and department of maine

so i don't have to execute this again so

so i will just

i will just remove this

remove this paste it away well done

so it's final underscore df dot info now you will be able to see two

dot info now you will be able to see two more features getting added and it will

more features getting added and it will be department of our

perfect we have done this now we have to take care of all these other things

take care of all these other things right airline and all are actually there

right airline and all are actually there so her departure is done

so her departure is done now let's catch up route

now let's catch up route now inside this you will be able to see

now inside this you will be able to see route

route is basically having this information

is basically having this information like bangalore to delhi

like bangalore to delhi bangalore to delhi okay

bangalore to delhi okay see anyhow over here you will be able to

see anyhow over here you will be able to see that uh even though i basically find

see that uh even though i basically find out like what is the route like route

out like what is the route like route one two three four

one two three four maximum to maximum over here you can see

maximum to maximum over here you can see that there are

that there are two places like bangalore is the origin

two places like bangalore is the origin delhi is the destination here you have

delhi is the destination here you have four different different places that

four different different places that basically means first you are going from

basically means first you are going from kolkata to ixr then ixr to bbi then bbi

kolkata to ixr then ixr to bbi then bbi to bangalore so total number of stops

to bangalore so total number of stops you have is two over here in this

you have is two over here in this particular case you just have one stop

particular case you just have one stop so what we will do is that we will try

so what we will do is that we will try to

to capture the route one route to all the

capture the route one route to all the all the places away over here in the

all the places away over here in the source and destination you just have two

source and destination you just have two values

values right number of

right number of stops you have to one like that you have

stops you have to one like that you have right so it is better that we get this

right so it is better that we get this specific information very much clearly

specific information very much clearly so that we actually

so that we actually be able to see route 1 route 2 route 3

be able to see route 1 route 2 route 3 route 4 like that right so

route 4 like that right so one thing that you need to know over

one thing that you need to know over here is that

here is that you may definitely get

you may definitely get null values you may definitely get null

null values you may definitely get null values a lot of null values you may be

values a lot of null values you may be getting

getting but understand null values will be there

but understand null values will be there for like if i want to capture for route

for like if i want to capture for route 4 definitely null values will be there

4 definitely null values will be there okay

okay instead of this also what we can do we

instead of this also what we can do we can also delete this and we can just

can also delete this and we can just focus on this total number of stops

focus on this total number of stops right total stops like total underscore

right total stops like total underscore stops we can also focus on this

stops we can also focus on this particular values also so what do you

particular values also so what do you think should we do

think should we do should we delete this specific feature

should we delete this specific feature directly

directly and just focus on

and just focus on because we have the source and the

because we have the source and the destination and obviously we have number

destination and obviously we have number of stops

of stops but

but i just think like as a person right we

i just think like as a person right we really need to focus on two things okay

really need to focus on two things okay first of all is that if probably i'm

first of all is that if probably i'm going from kolkata to bangalore and

going from kolkata to bangalore and these two places are going then the

these two places are going then the price might increase drastically

price might increase drastically okay just not based on the top of the

okay just not based on the top of the number of stops now in this particular

number of stops now in this particular case you can see from delhi to cok right

case you can see from delhi to cok right here you have lucknow and bombay lucknow

here you have lucknow and bombay lucknow in bombay

in bombay you feel that probably more price will

you feel that probably more price will be taken place over there

be taken place over there so

so just see what you need to do we can

just see what you need to do we can definitely drop this route you can just

definitely drop this route you can just focus on total stops but before focusing

focus on total stops but before focusing on total stops what i'm actually going

on total stops what i'm actually going to write i'm going to basically say

to write i'm going to basically say final underscore

final underscore total

total total stops

total stops dot

dot unique

unique if i write unique

if i write unique let's see how many total stops are there

let's see how many total stops are there so here you have

so here you have non-stops non-stop basically means

non-stops non-stop basically means probably

probably uh it's like just a single stop

uh it's like just a single stop here you can see here you can basically

here you can see here you can basically replace this with 0 here you can replace

replace this with 0 here you can replace with 2 here you can replace with 1

with 2 here you can replace with 1 3 this nand value if i try to see that

3 this nand value if i try to see that there is one null value i guess

there is one null value i guess is null

is null dot sum

dot sum so here you can see one nand value you

so here you can see one nand value you can replace it

can replace it uh

uh which one is

which one is required with respect to that okay so

so everybody focus on doing what we will try to convert this into and map these

try to convert this into and map these values with 0 1 2 3 4 5 like that

values with 0 1 2 3 4 5 like that tell me someone tell me the code

tell me someone tell me the code amazing

amazing so rishi has already written the code so

so rishi has already written the code so rishi has basically said something like

rishi has basically said something like this by using the map

this by using the map so here is my final underscore df

so here is my final underscore df final underscore df

final underscore df so final disco df total stops total

so final disco df total stops total stops dot map non-stop is equal to zero

stops dot map non-stop is equal to zero one stop is equal to 1 2 stops is equal

one stop is equal to 1 2 stops is equal to 2

to 2 3 is this

3 is this for nan also if you want to place place

for nan also if you want to place place it out because there is only one nand

it out because there is only one nand value so for nan also i will make sure

value so for nan also i will make sure that

that i can directly see right which is that

i can directly see right which is that specific record

specific record wait

wait i can definitely see which is that

i can definitely see which is that specific record for nan just a second

sorry final underscore df

total stops dot um what i can do is that dot is null

um what i can do is that dot is null dot

dot is null

is null and here i can basically write final

and here i can basically write final underscore df and i'll try to take out

underscore df and i'll try to take out this specific values

this specific values so here you can see route is nan but the

so here you can see route is nan but the total number of stops is also nan

total number of stops is also nan so total number of stops is also nan

so total number of stops is also nan route is also nan

route is also nan so here you can see from delhi to cochin

so here you can see from delhi to cochin okay delhi to coaching i don't think so

okay delhi to coaching i don't think so there will be a direct flight

there will be a direct flight but which value do you want to replace

but which value do you want to replace with since it is just a single record

with since it is just a single record i think it won't matter that much so let

i think it won't matter that much so let me do one thing let me just replace it

me do one thing let me just replace it with one stop

with one stop or

or just common sense i think for coaching

just common sense i think for coaching bangalore coaching at least one stop is

bangalore coaching at least one stop is required

required so like this i will just try to change

so like this i will just try to change it

it delete the coaching sorry

delete the coaching sorry so i have got executed now okay and now

so i have got executed now okay and now if i go and probably see my final

if i go and probably see my final underscore df

underscore df dot head you will be able to see the

dot head you will be able to see the specific values

specific values and

and here you can see total stops has been

here you can see total stops has been converted into integer floating value

converted into integer floating value now we can drop this route column so

now we can drop this route column so final underscore df drop

i'm going to drop route from axis is equal to 1

equal to 1 and in place is equal to true

and in place is equal to true because i don't definitely require 2 2

because i don't definitely require 2 2 information right

information right so finally you can see final underscore

so finally you can see final underscore df dot head

here you have all the values amazing

amazing now what is the next thing that you

now what is the next thing that you should probably want to do guys

should probably want to do guys i've deleted everything right so we have

i've deleted everything right so we have department department

department department departure hour also we have dropped

departure hour also we have dropped total stops is also there

total stops is also there let's catch up any other one you want to

let's catch up any other one you want to do

do additional info that all will be our

additional info that all will be our normal uh

normal uh feature engineering like transformation

feature engineering like transformation encoding we can do any special character

encoding we can do any special character if you if it is there somewhere probably

if you if it is there somewhere probably we have to catch hold of that so if i

we have to catch hold of that so if i write final underscore df

write final underscore df and if i go ahead with additional info

and if i go ahead with additional info additional info dot

additional info dot unique how many unique values are there

unique how many unique values are there so here you can see this many unique

so here you can see this many unique values are there this can be converted

values are there this can be converted into

into uh

uh one hot encoded format because there are

one hot encoded format because there are less number of records

less number of records let me just check

let me just check more anything that we can do with this

more anything that we can do with this data set anyone who wants to do some

data set anyone who wants to do some more things who wants to play with this

more things who wants to play with this data set who wants to

data set who wants to tear apart the specific data set

tear apart the specific data set let me just see df dot

let me just see df dot final underscore df dot

final underscore df dot info now here you will be able to see

info now here you will be able to see all this are there additional

all this are there additional information object that is fine

information object that is fine duration is still there

duration is still there okay

okay can we do something like convert this

can we do something like convert this duration into something else

duration into something else nah duration into minutes i'm basically

nah duration into minutes i'm basically need to convert duration into minutes

need to convert duration into minutes right so this this this this this i can

right so this this this this this i can basically apply a mathematical formula

basically apply a mathematical formula um

um i will just take this let's say

come on try it out guys try it out

try it out so here i'm basically going to write

so here i'm basically going to write duration

duration oh this way

oh this way 2 hours 50 minutes can be mentioned as

2 hours 50 minutes can be mentioned as 2.50 this will also be a good way

2.50 this will also be a good way um

um but what if i convert

but what if i convert duration into minutes that would

duration into minutes that would actually

actually be amazing okay

be amazing okay so here i'm basically going to say

so here i'm basically going to say duration

duration okay if i do split of zero that

okay if i do split of zero that basically means i'm getting my answer as

basically means i'm getting my answer as uh htr of zero

uh htr of zero split no

split no if i use this blank space i'll be

if i use this blank space i'll be getting two hours okay

getting two hours okay two hours two hours

two hours two hours and probably have to further split it

and probably have to further split it down

okay h is there this is becoming a series right now

okay series does not have a split perfect

perfect so if i have like this

duration two minutes sir can you run split it down with h

just start replace dot replace will work over here

see this becomes a series right now okay if i execute this

if i execute this and i'm actually getting something like

and i'm actually getting something like this okay

this okay then if i write htr of

zero comma zero

no this will also not work zero

this will also not work zero zero zero zero

zero zero zero come on anybody

um this is a series okay this is a series

this is a series guys understand we cannot do

cannot do string dot something like that see if i

string dot something like that see if i go and probably see the type of this

go and probably see the type of this this will definitely become a series

this will definitely become a series see it is a series

see it is a series i can search in the google

i can search in the google okay search in the google

okay search in the google series

split pandas

pandas series pandas provide method to split uh

series pandas provide method to split uh series series hdr dot split

series series hdr dot split str.split

str.split again i have to do dot htr dot split

again i have to do dot htr dot split okay so here i'm going to basically

okay so here i'm going to basically write htr dot

write htr dot split

split and here i'm going to basically use h

and here i'm going to basically use h see i'm getting it right

see i'm getting it right and then i can basically again write htr

and then i can basically again write htr of 0

of 0 so here i'm actually getting all the

so here i'm actually getting all the values

values this should be multiplied

this should be multiplied this should be converted into an integer

no this will actually be okay

okay so here i'm actually able to get all

so here i'm actually able to get all this information

okay this will basically give me the hours

this will basically give me the hours if i want to convert this into

if i want to convert this into minutes

minutes okay if i want to basically convert this

okay if i want to basically convert this into minutes what i have to do

now this is entirely series if i want to convert this into minutes

convert this into minutes as type

as type yeah as type can work

yeah as type can work dot ask type

dot ask type and

and no

no error is coming probably

error is coming probably no it will not work but

no it will not work but htr 0 will work

htr 0 will work so let's consider that i am converting

so let's consider that i am converting this into df of

this into df of duration

duration underscore hour

underscore hour is equal to this one

is equal to this one duration of hour

if i execute this final underscore

final underscore df

so duration hour i have actually got so with the help of duration hour we

so with the help of duration hour we will be able to do it okay

will be able to do it okay but you also have to get the minutes

but you also have to get the minutes because minutes are also very important

because minutes are also very important but before that what i'm actually going

but before that what i'm actually going to do i'm basically going to write

to do i'm basically going to write our final df for

our final df for dot info

dot info because i want to check

because i want to check whether

whether there's still an object right so what

there's still an object right so what i'm actually going to do

i'm actually going to do i'm basically going to convert this as

i'm basically going to convert this as type

type okay

okay final underscore df

final underscore df hey guys for me also same thing i am

hey guys for me also same thing i am also facing the same difficulty what you

also facing the same difficulty what you face

face right but we need to think of an

right but we need to think of an approach

approach if you are able to think as an approach

if you are able to think as an approach obviously that will get solved

obviously that will get solved uh what is the error

for end there is 5m somewhere

somewhere somewhere 5m is there

definitely 5 m is there somewhere 5 ohm value is there

final final underscore df of duration

w is equal to 5m

okay five minutes okay duration is also there for five minutes

there for five minutes okay this is the problem

but how how come five minutes mumbai to hyderabad will take only five

mumbai to hyderabad will take only five minutes

it is better we drop this we drop this features

features right

not possible right so how how this will be possible

so tell me if you want to remove this what you have to do

what you have to do alt

all duration that is the total duration right

yeah we have to probably drop these records right

records right okay tell me how to drop these records

okay tell me how to drop these records now

drop row axis zero okay perfect so if i write final

okay perfect so if i write final underscore df dot drop

underscore df dot drop and here i'm basically going to give my

and here i'm basically going to give my index number

index number uh should i use i lock to drop it

uh should i use i lock to drop it because here it will ask for labels so

because here it will ask for labels so suppose if i give six four five

suppose if i give six four five seven four comma axis is equal to zero

seven four comma axis is equal to zero you'll be able to see that it will get

you'll be able to see that it will get executed

executed right it is getting executed then

right it is getting executed then let's say n place is equal to one

let's say n place is equal to one and same thing i will probably do it for

and same thing i will probably do it for two six six zero

once a plane receive type as input for argument in

receive type as input for argument in place expected type boolean

so executed this is working fine now if i go and see this one i'm actually

i go and see this one i'm actually getting empty now okay

getting empty now okay so

so i have actually fixed this i will

i have actually fixed this i will convert this into as in type done

convert this into as in type done and then i will multiply this all by 60.

multiply by 60

60 so here you can see i'm actually able to

so here you can see i'm actually able to get this in the form of minutes

get this in the form of minutes or

or let it be an hour only then no problem

let it be an hour only then no problem if you don't want to do also it is fine

if you don't want to do also it is fine at least hours will increase but if you

at least hours will increase but if you are considering the minute part also so

are considering the minute part also so try to use that

try to use that okay and try to convert that that is

okay and try to convert that that is just given to you as an assignment

just given to you as an assignment please try to do for the minutes also

please try to do for the minutes also try to get that specific data what i

try to get that specific data what i have done for minutes okay

have done for minutes okay everybody you have to basically do it

everybody you have to basically do it okay don't say that chris you did not do

okay don't say that chris you did not do in the class so we are not going to do

in the class so we are not going to do don't do it so here you have integer

don't do it so here you have integer integer integer integer

integer integer integer price is float additional info is object

price is float additional info is object then you have duration now we can drop

then you have duration now we can drop the duration

the duration final underscore df dot drop

okay duration with axis is equal to 1

with axis is equal to 1 okay and then in place

okay and then in place is equal to 2

is equal to 2 so this is done why why why capital d

so this is done why why why capital d capital d capital d

capital d capital d okay duration done

okay duration done and then finally we have final

and then finally we have final underscore df dot

underscore df dot head of

head of one so here you can see i have all these

one so here you can see i have all these things remaining all have been converted

things remaining all have been converted remaining all are category features so

remaining all are category features so in order to do for the category features

in order to do for the category features one we need to do simple we will try to

one we need to do simple we will try to first of all see with respect to

first of all see with respect to airlines

airlines so

so uh

uh airline

airline dot

dot unique if i try to see this

unique if i try to see this how many are this specific airline

how many are this specific airline final underscore df

final underscore df so here you can see only this many airlines are there so we

only this many airlines are there so we will try to do label encoding for all of

will try to do label encoding for all of them now in order to do the label

them now in order to do the label encoding

encoding i will write from sk learn

i will write from sk learn dot pre-processing

dot pre-processing import label encoder

label encoder many people are saying right krish why

many people are saying right krish why you are doing get dummies get dummies

you are doing get dummies get dummies can also be done but since

can also be done but since we

we try to work with train and test data so

try to work with train and test data so it is better to use the transform

it is better to use the transform techniques right

techniques right so here i'm going to basically use label

so here i'm going to basically use label encoder

encoder is equal to label encoder

is equal to label encoder okay

okay so label encoder is there

so label encoder is there and then finally you do it for every

and then finally you do it for every data set that you want like airline

data set that you want like airline source destination

source destination and additional info so this four

and additional info so this four features so here you have final

features so here you have final underscore df

underscore df and here you can basically write

airline okay

okay label encoder

label encoder dot fit underscore transform

dot fit underscore transform and here i'm basically going to give my

and here i'm basically going to give my feature

feature that is final underscore dm

that is final underscore dm on

on airline right so like this i have

airline right so like this i have written for this now you do it for other

written for this now you do it for other feature also like this same way

how many features are there for right then you have source

source you can put it over here then you have destination

and then finally you have additional info

once you do this done and this is your final underscore df dot

and this is your final underscore df dot shape if i try to see there on 14

shape if i try to see there on 14 columns which is good enough

columns which is good enough and if i want to probably see my

and if i want to probably see my final disco day dot

final disco day dot head of first two records

head of first two records then you can see all these things

then you can see all these things perfect

okay i've done just done label encoding you can also do

you can also do other type of encoding that is one hot

other type of encoding that is one hot encoding

encoding it's okay guys i've done label encoding

it's okay guys i've done label encoding now one more step you can do is one hot

now one more step you can do is one hot encoding

from sk learn dot pre-processing import

pre-processing import one hot encoder just do it no

one hot encoder just do it no kevin uh don't do it with get dummies

kevin uh don't do it with get dummies because see whenever we have a test data

because see whenever we have a test data we need to transform that test data

we need to transform that test data right so we can save this

right so we can save this encoder in the form of pickle file

encoder in the form of pickle file right

right so one hot encoder so o h e

so one hot encoder so o h e i'll write it as one hot encoder

i'll write it as one hot encoder and then you can do the same thing

and then you can do the same thing where you're specifically saying

where you're specifically saying this

okay airline ohe dot fit transform

okay and then you have all the necessary

and then you have all the necessary other information

okay do it

do it okay i'm getting some error what is the

okay i'm getting some error what is the error

reshape your data okay i understood what is the problem

what is the problem i understood [Music]

[Music] how to give it as

wait i will execute it in front of you till then just see what is the error

till then just see what is the error that we are getting in this i have

that we are getting in this i have understood the error

understood the error of it transform c if i execute this i

of it transform c if i execute this i will be getting an expected 2d array

will be getting an expected 2d array dot

it is okay this is a series dot dot dot dot dot dot dot

dot dot dot dot o h e transform n p dot treble

o h e transform n p dot treble yeah

yeah np dot rival okay

there will be an error expected a 2d array instead of getting

expected a 2d array instead of getting one

one i can understand this i should not give

i can understand this i should not give this in the form of series

this in the form of series okay that is the problem

okay that is the problem i should definitely not give in the form

i should definitely not give in the form of series

of series so if i write

final underscore df of airline

so here you can see that i'm getting in the form of series this should not be in

the form of series this should not be in the form of series

use two brackets like this using

the double cases we are getting compressed sparse row format

p dot array df of airline okay one way i can basically do over

okay one way i can basically do over here is like np dot array

here is like np dot array final object dot

final object dot reshape

reshape minus 1 comma 1

airlines doors

so here will be source here will be destination

here will be destination and uh

and uh there will be additional info

there will be additional info but i hope you are able to understand

first one is ambiguous using get shape of zero

ah this is one hot encoding we are doing already encoding is done

wait wait wait wait let's see final underscore df

final underscore df dot head

so this is one hot encoding so if i probably search for

probably search for one hot encoding

sql on let's see the documentation

you are encoding many times no i did not encode many times i just

no i did not encode many times i just encoded one time right

encoded one time right so after encoding that value get has got

so after encoding that value get has got converted to this right now

converted to this right now if you go and see final underscore df

if you go and see final underscore df final underscore df dot

final underscore df dot info so here you will be able to see

info so here you will be able to see that

that this is all converted into integer types

this is all converted into integer types okay

okay i know i i should not had done this

i know i i should not had done this encoding separately like this fit

encoding separately like this fit transform instead of this i could have

transform instead of this i could have focused on

focused on one hot encoder it would have done it

one hot encoder it would have done it completely

completely but it's okay let's do one thing then

but it's okay let's do one thing then simple

simple if this is not working

if this is not working i'm just going to do a very simple thing

i'm just going to do a very simple thing so i'm i'm basically going to do final

so i'm i'm basically going to do final underscore df

of airline dot

dot get underscore dummies

get underscore dummies get under the dummies is not there

get under the dummies is not there okay

okay pd dot get dummies right

sometimes syntax it's very difficult to remember all the syntax

remember all the syntax df of airline

final df so let's go ahead and do this

so let's go ahead and do this and then you will be able to get it

try to create a different data frame let's say this is df1

let's say this is df1 then i will create another data frame

then i will create another data frame which is df2

which is df2 here i will say pd.get underscore

here i will say pd.get underscore dummies

and then here basically write it as other column final underscore df of

other column final underscore df of the next column that you wanted which

the next column that you wanted which one is the column that you are working

one is the column that you are working on

on source

source destination and additional info

will it work like this this is also a very good way

this is also a very good way see one single line they have written

see one single line they have written this will be my final underscore df

this will be my final underscore df columns are airline source destination

columns are airline source destination and additional info

sources additional info also

additional info also and probably this will definitely work

so what all things he has done is written pd dot get dummies final

written pd dot get dummies final underscore df columns with this all name

underscore df columns with this all name drop first is equal to true if i execute

drop first is equal to true if i execute it here is all the values that you will

it here is all the values that you will be able to get it thank you all have a

be able to get it thank you all have a great day ahead and

YouTube TranscriptPreparing your results…

YouTube Transcript:
Complete Exploratory Data Analysis And Feature Engineering In 3 Hours| Krish Naik

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube TranscriptPreparing your results…

YouTube Transcript:Complete Exploratory Data Analysis And Feature Engineering In 3 Hours| Krish Naik

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Complete Exploratory Data Analysis And Feature Engineering In 3 Hours| Krish Naik