YouTube Transcript:
Measure 19 Histogram
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
it's just that if you can see here this
is your box plot and this is your
histogram see so histogram is just
another view of what we can see in a box
plot but the reason why I
prefer box plot is that I you know I I
can see
I can see a lot of insights but there
are also
you know there are also
strong points that the histogram
do have and the Box plan doesn't have so
say for example if you want to let's say
get the view of how this data performed
as against to certain requirements so
let's say for example if you are
familiar with this view let's say for
example if you have your lower spec
limits and your upper spec limits here
okay and then what what histogram do is
you know to chart right using bar charts
and these bar charts are uh we put what
we call the frequency distribution or
the curve line here so this one
represents again the same thing with box plot
plot
uh the central tendency
and the dispersion
so remember that when you're using just
to refresh
um when you're using a histogram the
width of the distribution or the width
of the
let's say the curve let's just focus on
the curve for Simplicity purposes
uh it represents what you call the
dispersion so the The Wider
the base of your curve of course the
wider the distribution is
so that that that
a particular
uh width of your histogram the base of
the width uh it pretty much resembles
with the entire if you can see in this
view the entire span of your box plot
but the concentration of course would be
here right in the area of the box or the
interquartile range
so that's going to be the width of the
Box the The Wider the Box The Wider the
the base of the histogram mean it means
that uh The Wider the dispersion uh the
data has now
now
um the central tendency most of the
cases that's gonna be
where where the
the highest data point I mean the
highest data value of the the bar is
located so let's say for example for
this for this one it's pretty much
probably here
this one it's pretty much probably
here okay so this one it's pretty much
probably here so something like that so
that's how it it works um
um
in a histogram that still detect an outlier
outlier
what do you think can a histogram still
can can a histogram also detect an outlier
outlier
with this View
yes it could still detect outliers for
extremely low value example that's going
to be an outlier here so it's really
pretty much on how you would want your
your data to be presented that's not
some some would prefer I would prefer
using the histogram I'm more comfortable
with it some would say I would rather
use the the Box plug but if we talk
about Pros capability measures
um histogram is used not box plot so you
will talk more about that if not today
maybe tomorrow so that's uh the
histogram okay it's basically the same
function of uh you know I checking on
the distribution but it has uh a
different form two different forms
histogram in the Box plan okay so this
is basically an example so we can I'm
gonna jump again to minitab
okay so let's say let's go back to the
basic example of this one okay
um and let's try to create uh histogram
so let's sorry that's graph okay graph and
and
um go to [Music]
[Music]
so we can do histogram here okay
okay
and then you just have to click fit here
and then you just have to
um so graph you can do a histogram here
and then with fit
and then you just have to click this one
and you'll have this view that's one basic
basic
um flow that we can take on
uh I have here maybe about at 6.5
average if I'm not looking on this one
so pretty much here or here
okay or it's basically where the middle
um
curve lies okay so maybe here so 6.45 if
you look at the average here it's 6.45 okay
okay
so this one
so it's basically if you draw a line in
the middle of
of this curve where and then check where
it lies in the x-axis and then it will
give you at most plus minus some
difference of course small difference
okay if it's if we're using the eye
eyeball method
the the dispersion we cannot see the
actual value
but we can see the the distribution it's
you know a little wide compared to what
we expect let's say
so you can see the standard deviation as
a measure here so we're using standard
division rather than IQR for dispersion
uh we're comparing basically histogram
to box plot uh in terms of central
tendency we're using average rather than the
the [Music]
[Music] um
um
uh median inbox plan okay so that's
the
histogram okay using that path so so
that's how it looks like so you can
still you know
um capture that outlier but there's a
pretty much more convenient
um path so that's basically not under
graph but rather uh from stat
uh you go to basic stat and then you
look for this one it's called graphical
summary so this graphical summary will
pretty much give you every uh basic
statistics that is available to your consumption
so because we put that five of course um
um
uh where you're getting a different view
from your PDF okay
so this particular view gives
you the
um gives you the essential statistics
so how with this illustration that Vlad
requested how do you think
um doubtliers affects uh the central
tendency measures
say for example um
um
the mean now it becomes
6.42 okay
so because we have A5
what happens this extremely low value
tends to pull the value of
your average to the left correct because
it it gives you know some weight on the
left side of the distribution and it
somehow attracts it that hey
um if I'm not player I'm actually
inviting the average To Go near me
something like that and that creates
some sort of noise and bias
now that is where the value of the
median will be more useful if you have outliers
outliers
okay if you have outliers to avoid uh
the effect to avoid bias no
no
it's the because average is susceptible
to bias and errors in data to outliers
so what you want to check is the median
you might want to consider using the median
median
as your measure of central tendency this
is not actually
a common practice that is being done
especially for organizations that has
been uh using average as their basic
measure of central tendency but from
time to time uh there are cases that we
really need to resort to using the
median Even oee in our project oee
project from previous organization what
we use as a metric was not
average because there are you know lines
there are tools that are performing way below
below
the the common performance that is being
exhibited by this group of tools or
machines okay so by using average that
the the central tendency is somehow
polluted so what we decided was to use
uh median rather than mean for that project
project
okay just to give you an idea of of how
you can play of course with the
statistics and when is the best time to
use those statistics
so we can also check the normality
um normality test is basically required
a requirement before we do a before we
do any in-depth data analysis so it's uh
in statistics we call it assumptions so
there are assumptions when you do
certain tests that uh the the data set
should be following in normal
distribution if not then you you'll deploy
deploy
tests statistical tests that are
intended for non-normally distributed
data so
um statistical tests used
um used for normally distributed data is
called parametric test and statistical
tests used for non-normally distributed
data is called non-parametric test so
normality is a concept that is very
known to statistics I I hope I I guess
you're you're all heard of it from the
previous uh discussion the the normality
is something like this something like
that but in Practical context this is
what will happen so when you're doing a
project or when you're crunching data
first you have to understand how the
data is distributed right so you'll be
doing something like this
okay now
what and then if you see that hey
there's the P value is not uh 0.05 and
above or above 0.75 so which means that
the data is basically not the same the
distribution of the data is not the same
with a normal distribution so basically
it's uh non-normally distributed or we
could say that it's skewed right
so what will what will we do next um
um
excuse me what will we do next
shall we collect
another set of data so that's the
question right so come to think of this
say for example this this data that
you're seeing right now is the data that
is from your 12 weeks performance of
your primary metric for your project
okay because this is how it would go
this is supposedly how it will go from
Define you have you have summarize the
data but you haven't created any
distribution like this then you might
want to check the individual
distribution coming from the 12 week
performance because of course the 12
week that one data point in in that 12
week range that contains daily data and
that daily data contains another set of
data right so because this is just an
aggregation of the whole week so there's
going to be in probably let's say if we
talk about yield or if we talk about um
um
output production there's a daily data
within that mean uh daily data and
there's uh per machine data let's say if
we talk about output if we talk about
let's say um
um
yield that's going to be there's uh from
that week that's that's going to be um
um
what they call this daily data and the
daily data has maybe uh either per
machine or per per lot data of yield
right so and if you talk about this for
example for Michelle's case uh if we
talk about let's say customer
satisfaction index or the inventory
levels the inventor level uh average for
the week is further divided into average
the uh inverter level per day and that
that would be um on a per let's say per
s key or per line item so that's how the
data is
um constructed right or uh the
architecture of the data so
so
um imagine that if you're seeing that if
you're seeing that data metric in this
particular View
and if you see that hey the the p-value
is not saying that you know it's
normally distributed what will you do
next will you collect new a new data set
or would you rather understand why do I
have that outlier there
it's not basically we're not targeting
basically that the data should be
normally distributed okay it's not
always the case because there are data
sets that are intended to be uh
non-normally distributed
all right and that's the common
misconception of course as mentioned
earlier there are tests that are intended
intended
for normally distributed data and there
are tests that are intended for
not normally distributed data and in
statistics there is in there's a gray
area wherein it's called the central
limit theorem
the central limit theorem states that
um in a certain number of data points if
you talk about I think 41 data points
the distribution of the data might be
non-normally distributed but as but if
we you know extend the data collection
and increase the number of data points
at some point
uh the data will follow in normal distribution
distribution
so with that principle it it is being
used abusively to
uh to use a parametric test to
non-normally distributed data okay
okay
so that's that's one gray area that I'm
seeing in the field of statistics and I
think if you read uh if you have some
some spare time if you read so many
articles there's a great debate about it um
um
but uh to to vlad's point yeah that's
correct you want to understand what what
causes that outlier and uh eventually
remove that outlier so that you can have
a better view not to make the data
normally distributed but to have a
better view of the data without that outlier
outlier
okay having that outlier there is an
Insight or would trigger you to
investigate but if you would want to be
saying properly it's either you would
remove that after you understand what
caused that or you would want to use
a central tendency measure that is not
affected by that outlier which is the
median rather than using the average
so we don't have to be concerned that uh
it's not normally distributed so I have
to collect another set of data that
could be the case if you want to prove
that supposedly this data should should
be following
a normal distribution but at some cases
there are really factors that you know
that causes it to not be
um normally distributed and that's one
story another story is that there are
data sets for example uh meantime
between failure or maybe uh customer
satisfaction index you would want your
customer satisfaction index the higher
the better so the chart will be
something like
um probably something like this right so
it's cute because you would want a five
year rather than a one year right if
you're talking about customer
satisfaction index
okay so for example output output is
supposedly higher the better so it
should be let's say for example you
would want you don't want 1000 output
but you would want 5 000 output so the
higher the better
okay but if we're talking about
uh data that has plus minus say for example
example
um resistance of a certain PCB
um certain PCB or a certain Electronic
Component so that's going to be plus
minus let's say two two ohms
Okay so
that should follow a normal distribution
since you have a plus minus
minus plus sides right so that's gonna
be uh how we we treat this um
um
normal distribution thing okay
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc