Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
How to Do Data Exploration (step-by-step tutorial on real-life dataset) | Mısra Turp | YouTubeToText
YouTube Transcript: How to Do Data Exploration (step-by-step tutorial on real-life dataset)
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
hey welcome in this video let's learn
how to do data exploration
so this will be the first video in a
series of videos where i will teach you
things about data so how to deal with data
data
and the next one will be about data
cleaning and this series is kind of
going to be like a small version of my
course hands-on data science
so if you're curious about that i will
leave a link in the description for you
to go check that out
i always get so excited when i am
starting a new project because i feel
like data exploration and
cleaning and just generally
understanding the data set is one of the
most fun things that you can do
it also gives me a really good idea to
like while i'm exploring the data set it
gives me a really good idea to how to
clean it later
and also what kind of features that i
can come up with when it is time for
future engineering and
if i you know need to create some new
features so
just to have the just like a small
advice for me
keep these two questions in your mind
like how am i gonna clean it and how am
i going to create new features
in the back of your mind while you're
doing the exploration um
don't have to be very formal just some
things that you need to be
thinking about in the background so but
let's start with the technical part
so the first and foremost again i feel
like the second first and foremost thing
that i said but
in the technical side of things the
first thing that
you need to understand is when you find
a data set in the vial
out wild and out in the open on the
internet you
really can't understand a data set by
itself of course it could be a very
simple like i said a toy data set like a
small one
very easy to understand but most of the
time you're going to be seeing data
that has some column names that you
don't understand that is kind of
or it has values that you don't
understand it could have a
different unit the column column values
might have a different unit
you know the the way that things are
represented could be different so
likely there will be things that you
will not fully understand especially if it's
it's
in a new domain that you've never worked
in before so
how we solve this is there is always
going to be some explanation files
for the data set that you're finding
just want to show you some examples like
you know it might look like a pdf
like this one and it basically tells me
for this specific data set that i found online
online
each field or column name and what they mean
mean
what the do the different values for
this column mean for example it says
it has a raid code id i mean this is a
data set i used for my course hands on
data science
as a new york city taxi data set
so it's basically like all the trips
that happened in the
in new york in like a whole month or
something like that
uh so apparently this rate code you know
and if it me if it's one it means the
standard rate if it's two i guess it
means that
they're going to jfk or newark or you
know similar things or the payment type
you know if you just are looking at the
data you will not be able to understand
this so it's important to
see these explanation files and you're
always going to find them somewhere
so as i said it could be in a pdf format uh
uh
it could be apparently an hdmi
for html format so it will just generate
a website for you or a webpage
where you can see you know what does
this mean what does that mean
etc get some indication of like what
things mean
or it could also be just like random text
text
files so we can see this one for example
so this is just an explanation titled
the database sources
how can you use these etc etc and then here
here
uh the attributes so as i'm saying you know
know
the columns can be named a lot of things
so it could be columns fields
attributes features um yeah we just need
to kind of find this like table that explains
explains
uh what each column is sometimes they
give you the data type
sometimes they give you the um the range
of the
values that's also possible uh
yeah this one is like that or you know
this one
is a different one that also has
information about the columns
so this is the first thing that you need to
to
find if you are working in a company
or if you're getting the data set from a
different team so you didn't find it
online let's say
in that case you first want to
understand the data set like kind of
open it on a jupyter notebook like this
read it
and try to understand as much as you can
and write down questions
about what you don't understand and then
go and find the people who are
responsible for this data set and ask
your questions
like this happened to me before and what
i did is basically like just reach out
to the person who sent me the data file
and then she was like yeah i'm sure but
i someone else sends me these files i
don't really know much about them so i
was like okay
can you give me the name of the person
who sends you these files and i talked
to that person is like
yeah but i just downloaded from a system
so i don't really know how it's
collected so then i'm like okay i'll
just go to the
people who created the system that
collects the data
let's say it's the engineering team so
then i go and talk to the engineering
team's head
and he points me to the person who set
up the system or who's responsible for
the system
that collects the data and then you know
i ask him my question so sometimes it
takes a little bit of
time to get the data to understand the
data etc and kind of feel comfortable
with it
but this is very important because
otherwise you know you don't know what
you're working with
the data set that we're going to work
with today is called the street three
census data set so i don't know if
you're familiar with what a census is
it's basically when they go well what
they used to do is go from door to door
ask people how many people live in your
household how many of them are adults
how many of them are children
how many of them are working how much
money are you making so basically kind
of like
getting a lay off the land of like who
lives in a certain city or a country
but they did that for trees in new york
so um i might have mentioned this before
but there is a new york city open data platform
platform
where you can download a lot of
different types of data sets uh
and this this is also where i found this
i think it's really amazing it's a
really cute data set
so let me first start with reading it on
my notebook and then we start
all right so this is a data set uh it
has a lot of columns as you can see
there are the three dots it means that
it hasn't even been able to
uh show me all the columns but there is
actually a solution to that
if you change the settings to
you know how many columns that you want
to show i mean this basically says for
panda set the option
to maximum columns that will be shown to none
none
so then it will display all the columns
i don't want to set the
option to show all the rows because that
might be a little bit too much so i'll
change the setting and then
have it show me all the columns and yeah
i can see all the columns now
so as you can see there are a lot of
columns and some of them
are not really obvious what they are so
for example this tree id okay pretty
easy block id
yeah you know i i kind of understand
what it means
created at 3 dbh for example i don't
know what that is
uh so to understand what these things
are as i said we have the file
with us here very tiny font
size but at least we can understand what
things are so for example 3dbh
apparently is a diameter at
breast of the breast height of
tree so i guess the breast height of a
person uh
diameter of the tree measured
approximately 137 centimeters above the ground
ground
okay that's cool uh diameter of the stump
stump
i don't really know what a stump is so
when stuff like that happens
what you need to do is google tree stump
to see what it is tree stump surgery has
okay okay that's good to know
um uh
yeah yeah whether
tree is a long or offset from the curb
tree status in the case of the tree is
alive standing
dead or a stump tree health oh okay so
the tree might be a stump or not a stump
so if a tree is a stump i'm guessing
then we only get the
uh diameter of the stump otherwise we do
not have
that so this is an assumption that we're
making but we'll see if that is the case
or not
tree health indicates the user's
perception of tree health
scientific name common name of three species
species
number of signs of stewardship observed
and that is indicate the number of
unique signs of stewardship observed for
this tree not recorded for stems or dead trees
trees [Music]
[Music] um
um
so i guess steward stewards or are
people who are taking care of the tree
uh that's what i understand good
presence and type of tree guard sidewalk
damage immediately adjacent to the tree
category of users who collected this
tree point
all right so um this is good information
now we know more or less what these
things are
so let's look into our data set a little
bit in more detail
so at first it might seem a little bit overwhelming
overwhelming
to know where to start because you know
we have a lot of column names
and you're like wow this is going to
take a while to explore this data set to
understand what's going on
but i think one important thing is to
understand what your priorities are
so let's say generally i just want to understand
understand
uh the situation in new york about trees
how much of them are healthy how much of
them are not
i just want to get general feel of the
trees in new york state or near new york
city sorry
um so then i wouldn't really be caring
about their location
so when that happens you know then you
can basically just get rid of
all the columns that are giving you some
information of like which state they're
in which borrow they're in
uh which city which uh street they're in
et cetera
longitude and latitude of the tree etc
etc so
for now i can actually get rid of these
things and
i will make a list of all the columns
and uh to make it easy what i use
is the columns function it will give me
all the columns so it's easier for me to
delete the ones that i'm not interested
all right so this is this is more
manageable um these are all basically
if there is a problem caused by a stone
on the root if there's a problem on the
root caused by a great or other
same with the trunk and same with branches
branches
that's easy to understand the latin name
of the tree
if the person who was collecting the
betas thought it was healthy or not
at the status of the tree if there's damage
damage
on the sidewalk right next to the tree
if there are any other problems
um okay this is good i can work with this
this
so one of the first things that i want
to look at is the numerical
values so the numerical values that we
only have are the
diameter of the tree and diameter of the stump
stump
but as i figured out if this tree is a
stump so if it's cut already
then we do not have the diameter for it
which makes sense
but actually even before that i want to
see if there are any null var
this is how i can see if there are any
missing values or not so it looks like
with health
we have a bunch of missing values so how
many values do we have 683
rows and out of the 683
31 000 is missing
for these columns so i want to see what
it looks like when these values are missing
missing
so i'm going to say show me all the ones where
where
all right so it looks like they're kind
of like none
for the same things i can decide to
remove these values later or not that's
kind of up to me at this point
but you know i'm just exploring i'm not
doing i'm not taking any action so this
is something that you can take note of
you can say hey there are a lot of
missing values
i just want to get rid of these ones
okay this is good
um so let's do describe to get a general feel
feel
okay so as i said we only have two
numerical values well the first one tree
id is
actually a categorical value because
it's the id is not a continuous value
but it currently sees it as numerical
value and we can see that if we do d types
yeah most of them are strings but the
first one is an integer
so as an integer because the id of the
tree but it's actually
uh should be a categorical value so what
i'm looking at is here the tree
diameter and stump diameter so
yeah we have these many values the mean
is 11
centimeters standard deviation is 8 centimeters
centimeters
um okay
these are some tiny trees and maximum is 450
450
that's uh really
the user collected another diameter of
the height of the tree
integer diameters we measured and
close to both living trees um
types are more accessible than forestry
specific let me
make it close because measuring tapes
are more accessible than forestry
specific measuring tapes designed to
measure diameter users originally
measured three circumference in the field
field
to better match other forestry data sets
the circumference value was subsequently
divided by the pi
to transform into diameter both the
field measure
mint and process value were rounded to
the nearest whole inch so okay
it's not a centimeter it's an inch so
this is important knowledge for us
i don't know what an inch is so one inch
in centimeters is 2.54
so that actually is like even more in
centimeters so if it's like a hund
450 inches that's like
a lot of centimeters 450
um okay 1000 centimeters
it doesn't sound likely but let's
take a closer look at this actually so
what i'm going to do is i'm going to create
create
some histograms because i want to see
the distribution of these values so
let's say hist
and i already want to make the bins a
little bit bigger
yeah it spins and also want to make the
figure size
all right it's interesting so again i'm not
not
looking at the tree id as i said it's
not really something that's uh
important stump diameter
uh it's close to zero most of the time
but then
uh i don't know if you watch some other
videos of mine but i think i mentioned
this in the hands on data science course
if you can see this value here that
actually means that there needs to be a
value here because that's how the
histogram histograms are being made so
if the maximum value was 60 the histogram
histogram
range the x-axis range would be from 0
to 60 but if
if it goes to all the way to 140 means
there is a value somewhere here even if
it's just one
uh same thing with the three diameter
so i think something went wrong there
probably when someone was trying to
you know put the value in they
actually want to write like 45 but i
accidentally wrote like 450
so we don't know that so i also want to
see like how many values
are here or how many values are here so
it looks like a logical way to cut this
off is maybe like 40 here because as you
can see there is still some values here
and also maybe like 100 inches here
even that sounds like unreasonable to me
to be honest but
um so let's see let me let me visualize
that a little bit so
give me all the trees where the tree
diameter is
bigger than we're going to say like 50
to be honest
so we have 300 values
that is where the diameter is calculated
or measured to be
bigger than 50 inches uh
interesting so what if i visualize this
okay this is more or less what i
expected but let me make this figure a
yeah there are some here which i guess
expected but like
especially after here it's kind of like
you know they're one or two trees
this uh seems very silly to me that
there will be a tree
whose diameter is 450. i mean
yeah even even the circum circumference
for 250 inches sounds like
ridiculous uh but yeah this is just some
information that we have right now
good to know good to know um we can do
the same thing with
all right similar similar thing i guess
maybe until up until this point it's acceptable
acceptable
maybe they forgot to change it to
diameter they put in the
circumference and yeah and these values
are kind of like
not really correct could be
uh yeah all right i mean i guess 140 inches
inches
is not unreasonable if it's uh
the circumference and they forgot to put
it into diameter otherwise it should be
like a very big tree right if the
diameter is
140 inches and that makes like a lot of
centimeters up makes like three meters
or something
a three three meter diameter is kind of like
like
yeah that would be a bit too big i guess
i'm just trying to like work it
in my head you know i'm trying to
understand if that's like actually
unreasonable or not
but yeah when we move to the cleaning
bit i guess one thing that we would need
to ask ourselves is
is this really unreasonable is this not
really unreasonable and what can we do
with this data point
or data points um
what else do we want to look at i want
to see what are the possible options for
some of these
other columns so latin name probably
there are a lot of different latin names
but you know it could be interesting to see
see
the distribution of different names so
okay i guess this tree is very common
and then we have less common trees here
and one thing you can do to visualize this
this
is to turn this counts into a data frame
and then plot it or maybe
not histogram i guess plot
let me try a bar chart that should work
yeah okay cool yeah i mean not the most readable
readable
chart in the world but at least it gives
you an understanding of uh you know
how many trees there are and which types
there are and this is kind of expected
right you would expect some trees to be
like very common and then as you go it's
like less and less common
all right what else can we look at so um [Music]
[Music]
i saw that yeah there are stewards right
some of the values for the stewards are missing
missing
uh sidewalk is missing a problem is missing
missing
but i want to see like what are the
options for
stewards so we can see here one or two
or none
but what are what are some options
okay so it's either none one or two
three or four four or more
this is good it looks very um
standardized so you know if everyone had
to write it down by themselves you can
see someone writing one or two
someone else writing like this someone
else riding like this so
it's it's possible it's good to see that
it's clean
uh i want to see the possibilities for sidewalk
and no damage or damage okay good
um i guess for these ones it's either
yes or no
i'm guessing this could be like a you
know website where they fill in a form
so this looks pretty standard
um i want to see the status and curb
location also
let's do it quickly uh-huh
curb lock was it
no what was it curb lock with the underscore
on curb or offset from the curve okay cool
cool
doesn't seem to be neat for uh extra
correction there um
all right one last thing that i want to
look into is
if there is um some mismatch between you
know this tree being a stump
and uh the health of the tree for example
example
if there is any point where it says it's
a stump but it says like health is
good or something like that so
where uh what is the name of that column
uh okay then you know then i can say stumps
[Music]
so this is like a new subset and
it's all the values none for the stumps
no oh so maybe these like stewards
sidewalk problems
health stuff are none for all the stumps
and all the dead trees so let me
961 so you know there are 17
654 stumps it says and 13
9006 dead trees that amounts to 31
615 uh
total dead or stump trees
so basically how many values did we miss
yeah it's more or less the same so i guess
guess
what happened is if the tree is not
alive they didn't
either bother or they didn't think it
was relevant to fill in the information
for the rest for the health latin name steward
steward
sidewalk problems etc and probably for
these ones to just like put zero or
something like that
okay so that's that's good to know
um and also one other thing that i want
to look at is
okay so what i wanted to see is
actually how many yeses and how many
nodes there are for each of these columns
columns
uh of course it's going to seem really
instant to you that i
achieved this but i actually took like
20 minutes or something looking online
to see as you can see from my
searches here um how i can see this
information so basically
uh i found that out so what i need to do
is just
assign this to a data set what should we
call it um
so this is a data set and apparently the
you just apply the value counts function
to each of the series
and then you're able to see the values
for all of them so
okay so let's see it looks like rootstone
rootstone
problems seem to exist a lot of the time
uh yeah no other problem exists that much
much
so yeah so problems caused
on the route by stones is a big problem
uh yeah this is just good information to
have you know these are also things like
when you're starting a project these are
some information some statistics that
you can give to whoever is responsible
for the project or if you just want to
like kind of
show that you are progressing with the
problem or with the project
these are some good information uh to
show to people you can also turn this
into like a visualization and show it
that way
if these are relevant things to you okay
so we what we did is
we first looked into what are the
possible features
columns that we can use we decided that
we don't really want to get into the
details of where the tree is located
which borough is responsible from the
tree which street it's on
uh so we haven't been look look we
didn't look into that because we decided
that's not important for
our purposes um we looked into the
columns that this data set has
we chose the relevant columns only on
this relevant columns we first looked into
into
the missing values uh again we haven't
done anything with these things yet
that will be in the next video and we
looked into
um you know if these missing values
happen all at the same time or not
then we looked into the uh numeric
values and how they're distributed in general
general
we looked into the distribution the
histogram and so that some of the values
are look
are looking a little bit suspicious so
we went deeper in those values
and look how how many of them are this
outrageously high so
if you saw a really um
even distribution here that there are a
lot of dots here and there
then you might think that okay maybe
this is normal to have but when you see
there are only a bunch here the kind of
like outliers
then you can decide okay maybe this was
a mistaken
uh input by whoever was collecting data
same with stumps that we saw this so
uh and then we looked into some names
how the distribution of different types
of trees
is in new york's city and
some just to make sure that all the
categorical values are standard and
there is no
different terminology you know as i said
from one to two there's not like one
dash two or anything like that so
just to make sure that the values are
standardized here we looked at
some of the or most of the columns
and we actually figured out that when it stumps
stumps
or when it's a dead tree the information
on health
left names that whether it has a steward
or not
uh whether it's on the sidewalk or not etc
etc
if there are any problems with the tree
it has not been
recorded this is a good thing to know
about the
data set and we saw that
the basically the distribution or how
many problems what kind of problems
there are
on trees so this is a good place to stop
and from now on what we're going to do
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.