so at first it might seem a little bit overwhelming
overwhelming
to know where to start because you know
we have a lot of column names
and you're like wow this is going to
take a while to explore this data set to
understand what's going on
but i think one important thing is to
understand what your priorities are
so let's say generally i just want to understand
understand
uh the situation in new york about trees
how much of them are healthy how much of
them are not
i just want to get general feel of the
trees in new york state or near new york
city sorry
um so then i wouldn't really be caring
about their location
so when that happens you know then you
can basically just get rid of
all the columns that are giving you some
information of like which state they're
in which borrow they're in
uh which city which uh street they're in
et cetera
longitude and latitude of the tree etc
etc so
for now i can actually get rid of these
things and
i will make a list of all the columns
and uh to make it easy what i use
is the columns function it will give me
all the columns so it's easier for me to
delete the ones that i'm not interested
all right so this is this is more
manageable um these are all basically
if there is a problem caused by a stone
on the root if there's a problem on the
root caused by a great or other
same with the trunk and same with branches
branches
that's easy to understand the latin name
of the tree
if the person who was collecting the
betas thought it was healthy or not
at the status of the tree if there's damage
damage
on the sidewalk right next to the tree
if there are any other problems
um okay this is good i can work with this
this
so one of the first things that i want
to look at is the numerical
values so the numerical values that we
only have are the
diameter of the tree and diameter of the stump
stump
but as i figured out if this tree is a
stump so if it's cut already
then we do not have the diameter for it
which makes sense
but actually even before that i want to
see if there are any null var
this is how i can see if there are any
missing values or not so it looks like
with health
we have a bunch of missing values so how
many values do we have 683
rows and out of the 683
31 000 is missing
for these columns so i want to see what
it looks like when these values are missing
missing
so i'm going to say show me all the ones where
where
all right so it looks like they're kind
of like none
for the same things i can decide to
remove these values later or not that's
kind of up to me at this point
but you know i'm just exploring i'm not
doing i'm not taking any action so this
is something that you can take note of
you can say hey there are a lot of
missing values
i just want to get rid of these ones
okay this is good
um so let's do describe to get a general feel
feel
okay so as i said we only have two
numerical values well the first one tree
id is
actually a categorical value because
it's the id is not a continuous value
but it currently sees it as numerical
value and we can see that if we do d types
yeah most of them are strings but the
first one is an integer
so as an integer because the id of the
tree but it's actually
uh should be a categorical value so what
i'm looking at is here the tree
diameter and stump diameter so
yeah we have these many values the mean
is 11
centimeters standard deviation is 8 centimeters
centimeters
um okay
these are some tiny trees and maximum is 450
450
that's uh really
the user collected another diameter of
the height of the tree
integer diameters we measured and
close to both living trees um
types are more accessible than forestry
specific let me
make it close because measuring tapes
are more accessible than forestry
specific measuring tapes designed to
measure diameter users originally
measured three circumference in the field
field
to better match other forestry data sets
the circumference value was subsequently
divided by the pi
to transform into diameter both the
field measure
mint and process value were rounded to
the nearest whole inch so okay
it's not a centimeter it's an inch so
this is important knowledge for us
i don't know what an inch is so one inch
in centimeters is 2.54
so that actually is like even more in
centimeters so if it's like a hund
450 inches that's like
a lot of centimeters 450
um okay 1000 centimeters
it doesn't sound likely but let's
take a closer look at this actually so
what i'm going to do is i'm going to create
create
some histograms because i want to see
the distribution of these values so
let's say hist
and i already want to make the bins a
little bit bigger
yeah it spins and also want to make the
figure size
all right it's interesting so again i'm not
not
looking at the tree id as i said it's
not really something that's uh
important stump diameter
uh it's close to zero most of the time
but then
uh i don't know if you watch some other
videos of mine but i think i mentioned
this in the hands on data science course
if you can see this value here that
actually means that there needs to be a
value here because that's how the
histogram histograms are being made so
if the maximum value was 60 the histogram
histogram
range the x-axis range would be from 0
to 60 but if
if it goes to all the way to 140 means
there is a value somewhere here even if
it's just one
uh same thing with the three diameter
so i think something went wrong there
probably when someone was trying to
you know put the value in they
actually want to write like 45 but i
accidentally wrote like 450
so we don't know that so i also want to
see like how many values
are here or how many values are here so
it looks like a logical way to cut this
off is maybe like 40 here because as you
can see there is still some values here
and also maybe like 100 inches here
even that sounds like unreasonable to me
to be honest but
um so let's see let me let me visualize
that a little bit so
give me all the trees where the tree
diameter is
bigger than we're going to say like 50
to be honest
so we have 300 values
that is where the diameter is calculated
or measured to be
bigger than 50 inches uh
interesting so what if i visualize this
okay this is more or less what i
expected but let me make this figure a
yeah there are some here which i guess
expected but like
especially after here it's kind of like
you know they're one or two trees
this uh seems very silly to me that
there will be a tree
whose diameter is 450. i mean
yeah even even the circum circumference
for 250 inches sounds like
ridiculous uh but yeah this is just some
information that we have right now
good to know good to know um we can do
the same thing with
all right similar similar thing i guess
maybe until up until this point it's acceptable
acceptable
maybe they forgot to change it to
diameter they put in the
circumference and yeah and these values
are kind of like
not really correct could be
uh yeah all right i mean i guess 140 inches
inches
is not unreasonable if it's uh
the circumference and they forgot to put
it into diameter otherwise it should be
like a very big tree right if the
diameter is
140 inches and that makes like a lot of
centimeters up makes like three meters
or something
a three three meter diameter is kind of like
like
yeah that would be a bit too big i guess
i'm just trying to like work it
in my head you know i'm trying to
understand if that's like actually
unreasonable or not
but yeah when we move to the cleaning
bit i guess one thing that we would need
to ask ourselves is
is this really unreasonable is this not
really unreasonable and what can we do
with this data point
or data points um
what else do we want to look at i want
to see what are the possible options for
some of these
other columns so latin name probably
there are a lot of different latin names
but you know it could be interesting to see
see
the distribution of different names so
okay i guess this tree is very common
and then we have less common trees here
and one thing you can do to visualize this
this
is to turn this counts into a data frame
and then plot it or maybe
not histogram i guess plot
let me try a bar chart that should work
yeah okay cool yeah i mean not the most readable
readable
chart in the world but at least it gives
you an understanding of uh you know
how many trees there are and which types
there are and this is kind of expected
right you would expect some trees to be
like very common and then as you go it's
like less and less common
all right what else can we look at so um [Music]
[Music]
i saw that yeah there are stewards right
some of the values for the stewards are missing
missing
uh sidewalk is missing a problem is missing
missing
but i want to see like what are the
options for
stewards so we can see here one or two
or none
but what are what are some options
okay so it's either none one or two
three or four four or more
this is good it looks very um
standardized so you know if everyone had
to write it down by themselves you can
see someone writing one or two
someone else writing like this someone
else riding like this so
it's it's possible it's good to see that
it's clean
uh i want to see the possibilities for sidewalk
and no damage or damage okay good
um i guess for these ones it's either
yes or no
i'm guessing this could be like a you
know website where they fill in a form
so this looks pretty standard
um i want to see the status and curb
location also
let's do it quickly uh-huh
curb lock was it
no what was it curb lock with the underscore
on curb or offset from the curve okay cool
cool
doesn't seem to be neat for uh extra
correction there um
all right one last thing that i want to
look into is
if there is um some mismatch between you
know this tree being a stump
and uh the health of the tree for example
example
if there is any point where it says it's
a stump but it says like health is
good or something like that so
where uh what is the name of that column
uh okay then you know then i can say stumps
[Music]
so this is like a new subset and
it's all the values none for the stumps
no oh so maybe these like stewards
sidewalk problems
health stuff are none for all the stumps
and all the dead trees so let me
961 so you know there are 17
654 stumps it says and 13
9006 dead trees that amounts to 31
615 uh
total dead or stump trees
so basically how many values did we miss
yeah it's more or less the same so i guess
guess
what happened is if the tree is not
alive they didn't
either bother or they didn't think it
was relevant to fill in the information
for the rest for the health latin name steward
steward
sidewalk problems etc and probably for
these ones to just like put zero or
something like that
okay so that's that's good to know
um and also one other thing that i want
to look at is
okay so what i wanted to see is
actually how many yeses and how many
nodes there are for each of these columns
columns
uh of course it's going to seem really
instant to you that i
achieved this but i actually took like
20 minutes or something looking online
to see as you can see from my
searches here um how i can see this
information so basically
uh i found that out so what i need to do
is just
assign this to a data set what should we
call it um
so this is a data set and apparently the
you just apply the value counts function
to each of the series
and then you're able to see the values
for all of them so
okay so let's see it looks like rootstone
rootstone
problems seem to exist a lot of the time
uh yeah no other problem exists that much
much
so yeah so problems caused
on the route by stones is a big problem
uh yeah this is just good information to
have you know these are also things like
when you're starting a project these are
some information some statistics that
you can give to whoever is responsible
for the project or if you just want to
like kind of
show that you are progressing with the
problem or with the project
these are some good information uh to
show to people you can also turn this
into like a visualization and show it
that way
if these are relevant things to you okay
so we what we did is
we first looked into what are the
possible features
columns that we can use we decided that
we don't really want to get into the
details of where the tree is located
which borough is responsible from the
tree which street it's on
uh so we haven't been look look we
didn't look into that because we decided
that's not important for
our purposes um we looked into the
columns that this data set has
we chose the relevant columns only on
this relevant columns we first looked into
into
the missing values uh again we haven't
done anything with these things yet
that will be in the next video and we
looked into
um you know if these missing values
happen all at the same time or not
then we looked into the uh numeric
values and how they're distributed in general
general
we looked into the distribution the
histogram and so that some of the values
are look
are looking a little bit suspicious so
we went deeper in those values
and look how how many of them are this
outrageously high so
if you saw a really um
even distribution here that there are a
lot of dots here and there
then you might think that okay maybe
this is normal to have but when you see
there are only a bunch here the kind of
like outliers
then you can decide okay maybe this was
a mistaken
uh input by whoever was collecting data
same with stumps that we saw this so
uh and then we looked into some names
how the distribution of different types
of trees
is in new york's city and
some just to make sure that all the
categorical values are standard and
there is no
different terminology you know as i said
from one to two there's not like one
dash two or anything like that so
just to make sure that the values are
standardized here we looked at
some of the or most of the columns
and we actually figured out that when it stumps
stumps
or when it's a dead tree the information
on health
left names that whether it has a steward
or not
uh whether it's on the sidewalk or not etc
etc
if there are any problems with the tree
it has not been
recorded this is a good thing to know
about the
data set and we saw that
the basically the distribution or how
many problems what kind of problems
there are
on trees so this is a good place to stop
and from now on what we're going to do
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.