0:08 it's just that if you can see here this
0:11 is your box plot and this is your
0:15 histogram see so histogram is just
0:17 another view of what we can see in a box
0:21 plot but the reason why I
0:24 prefer box plot is that I you know I I
0:26 can see
0:29 I can see a lot of insights but there
0:30 are also
0:33 you know there are also
0:35 strong points that the histogram
0:40 do have and the Box plan doesn't have so
0:43 say for example if you want to let's say
0:48 get the view of how this data performed
0:52 as against to certain requirements so
0:54 let's say for example if you are
0:56 familiar with this view let's say for
0:57 example if you have your lower spec
1:01 limits and your upper spec limits here
1:04 okay and then what what histogram do is
1:09 you know to chart right using bar charts
1:13 and these bar charts are uh we put what
1:15 we call the frequency distribution or
1:18 the curve line here so this one
1:21 represents again the same thing with box plot
1:22 plot
1:24 uh the central tendency
1:26 and the dispersion
1:29 so remember that when you're using just
1:31 to refresh
1:35 um when you're using a histogram the
1:38 width of the distribution or the width
1:40 of the
1:42 let's say the curve let's just focus on
1:45 the curve for Simplicity purposes
1:47 uh it represents what you call the
1:51 dispersion so the The Wider
1:54 the base of your curve of course the
1:56 wider the distribution is
1:59 so that that that
2:02 a particular
2:05 uh width of your histogram the base of
2:08 the width uh it pretty much resembles
2:11 with the entire if you can see in this
2:14 view the entire span of your box plot
2:17 but the concentration of course would be
2:21 here right in the area of the box or the
2:23 interquartile range
2:25 so that's going to be the width of the
2:27 Box the The Wider the Box The Wider the
2:30 the base of the histogram mean it means
2:33 that uh The Wider the dispersion uh the
2:35 data has now
2:36 now
2:39 um the central tendency most of the
2:41 cases that's gonna be
2:44 where where the
2:48 the highest data point I mean the
2:51 highest data value of the the bar is
2:53 located so let's say for example for
2:55 this for this one it's pretty much
2:58 probably here
3:01 this one it's pretty much probably
3:04 here okay so this one it's pretty much
3:07 probably here so something like that so
3:10 that's how it it works um
3:10 um
3:14 in a histogram that still detect an outlier
3:15 outlier
3:18 what do you think can a histogram still
3:20 can can a histogram also detect an outlier
3:21 outlier
3:23 with this View
3:26 yes it could still detect outliers for
3:28 extremely low value example that's going
3:30 to be an outlier here so it's really
3:33 pretty much on how you would want your
3:35 your data to be presented that's not
3:37 some some would prefer I would prefer
3:39 using the histogram I'm more comfortable
3:42 with it some would say I would rather
3:45 use the the Box plug but if we talk
3:47 about Pros capability measures
3:50 um histogram is used not box plot so you
3:52 will talk more about that if not today
3:54 maybe tomorrow so that's uh the
3:57 histogram okay it's basically the same
4:01 function of uh you know I checking on
4:05 the distribution but it has uh a
4:07 different form two different forms
4:10 histogram in the Box plan okay so this
4:13 is basically an example so we can I'm
4:16 gonna jump again to minitab
4:19 okay so let's say let's go back to the
4:23 basic example of this one okay
4:31 um and let's try to create uh histogram
4:35 so let's sorry that's graph okay graph and
4:37 and
4:38 um go to [Music]
4:40 [Music]
4:42 so we can do histogram here okay
4:44 okay
4:47 and then you just have to click fit here
4:49 and then you just have to
4:54 um so graph you can do a histogram here
4:55 and then with fit
4:58 and then you just have to click this one
5:02 and you'll have this view that's one basic
5:04 basic
5:06 um flow that we can take on
5:12 uh I have here maybe about at 6.5
5:15 average if I'm not looking on this one
5:20 so pretty much here or here
5:24 okay or it's basically where the middle
5:29 um
5:34 curve lies okay so maybe here so 6.45 if
5:37 you look at the average here it's 6.45 okay
5:39 okay
5:40 so this one
5:42 so it's basically if you draw a line in
5:43 the middle of
5:46 of this curve where and then check where
5:49 it lies in the x-axis and then it will
5:52 give you at most plus minus some
5:54 difference of course small difference
5:57 okay if it's if we're using the eye
5:59 eyeball method
6:03 the the dispersion we cannot see the
6:04 actual value
6:08 but we can see the the distribution it's
6:10 you know a little wide compared to what
6:12 we expect let's say
6:14 so you can see the standard deviation as
6:16 a measure here so we're using standard
6:20 division rather than IQR for dispersion
6:23 uh we're comparing basically histogram
6:25 to box plot uh in terms of central
6:29 tendency we're using average rather than the
6:30 the [Music]
6:30 [Music] um
6:32 um
6:35 uh median inbox plan okay so that's
6:46 the
6:50 histogram okay using that path so so
6:51 that's how it looks like so you can
6:53 still you know
6:56 um capture that outlier but there's a
6:59 pretty much more convenient
7:01 um path so that's basically not under
7:04 graph but rather uh from stat
7:07 uh you go to basic stat and then you
7:09 look for this one it's called graphical
7:11 summary so this graphical summary will
7:15 pretty much give you every uh basic
7:18 statistics that is available to your consumption
7:27 so because we put that five of course um
7:28 um
7:30 uh where you're getting a different view
7:33 from your PDF okay
7:38 so this particular view gives
7:41 you the
7:45 um gives you the essential statistics
7:49 so how with this illustration that Vlad
7:52 requested how do you think
7:56 um doubtliers affects uh the central
7:58 tendency measures
8:00 say for example um
8:01 um
8:04 the mean now it becomes
8:06 6.42 okay
8:11 so because we have A5
8:14 what happens this extremely low value
8:18 tends to pull the value of
8:21 your average to the left correct because
8:24 it it gives you know some weight on the
8:26 left side of the distribution and it
8:29 somehow attracts it that hey
8:31 um if I'm not player I'm actually
8:34 inviting the average To Go near me
8:36 something like that and that creates
8:39 some sort of noise and bias
8:42 now that is where the value of the
8:45 median will be more useful if you have outliers
8:47 outliers
8:52 okay if you have outliers to avoid uh
8:54 the effect to avoid bias no
8:56 no
8:58 it's the because average is susceptible
9:01 to bias and errors in data to outliers
9:06 so what you want to check is the median
9:08 you might want to consider using the median
9:09 median
9:11 as your measure of central tendency this
9:13 is not actually
9:16 a common practice that is being done
9:18 especially for organizations that has
9:23 been uh using average as their basic
9:27 measure of central tendency but from
9:29 time to time uh there are cases that we
9:31 really need to resort to using the
9:36 median Even oee in our project oee
9:39 project from previous organization what
9:42 we use as a metric was not
9:46 average because there are you know lines
9:51 there are tools that are performing way below
9:52 below
9:56 the the common performance that is being
9:59 exhibited by this group of tools or
10:02 machines okay so by using average that
10:05 the the central tendency is somehow
10:09 polluted so what we decided was to use
10:13 uh median rather than mean for that project
10:14 project
10:17 okay just to give you an idea of of how
10:18 you can play of course with the
10:20 statistics and when is the best time to
10:23 use those statistics
10:28 so we can also check the normality
10:30 um normality test is basically required
10:34 a requirement before we do a before we
10:38 do any in-depth data analysis so it's uh
10:41 in statistics we call it assumptions so
10:42 there are assumptions when you do
10:45 certain tests that uh the the data set
10:47 should be following in normal
10:50 distribution if not then you you'll deploy
10:51 deploy
10:54 tests statistical tests that are
10:56 intended for non-normally distributed
10:58 data so
11:01 um statistical tests used
11:04 um used for normally distributed data is
11:07 called parametric test and statistical
11:10 tests used for non-normally distributed
11:13 data is called non-parametric test so
11:15 normality is a concept that is very
11:18 known to statistics I I hope I I guess
11:20 you're you're all heard of it from the
11:23 previous uh discussion the the normality
11:24 is something like this something like
11:28 that but in Practical context this is
11:30 what will happen so when you're doing a
11:32 project or when you're crunching data
11:34 first you have to understand how the
11:36 data is distributed right so you'll be
11:38 doing something like this
11:39 okay now
11:42 what and then if you see that hey
11:46 there's the P value is not uh 0.05 and
11:50 above or above 0.75 so which means that
11:53 the data is basically not the same the
11:55 distribution of the data is not the same
11:58 with a normal distribution so basically
12:01 it's uh non-normally distributed or we
12:03 could say that it's skewed right
12:06 so what will what will we do next um
12:07 um
12:10 excuse me what will we do next
12:12 shall we collect
12:15 another set of data so that's the
12:17 question right so come to think of this
12:20 say for example this this data that
12:22 you're seeing right now is the data that
12:26 is from your 12 weeks performance of
12:29 your primary metric for your project
12:32 okay because this is how it would go
12:35 this is supposedly how it will go from
12:38 Define you have you have summarize the
12:40 data but you haven't created any
12:42 distribution like this then you might
12:44 want to check the individual
12:45 distribution coming from the 12 week
12:48 performance because of course the 12
12:51 week that one data point in in that 12
12:55 week range that contains daily data and
12:57 that daily data contains another set of
12:59 data right so because this is just an
13:02 aggregation of the whole week so there's
13:04 going to be in probably let's say if we
13:06 talk about yield or if we talk about um
13:07 um
13:09 output production there's a daily data
13:12 within that mean uh daily data and
13:14 there's uh per machine data let's say if
13:16 we talk about output if we talk about
13:18 let's say um
13:18 um
13:21 yield that's going to be there's uh from
13:24 that week that's that's going to be um
13:26 um
13:28 what they call this daily data and the
13:31 daily data has maybe uh either per
13:33 machine or per per lot data of yield
13:36 right so and if you talk about this for
13:38 example for Michelle's case uh if we
13:39 talk about let's say customer
13:42 satisfaction index or the inventory
13:44 levels the inventor level uh average for
13:46 the week is further divided into average
13:49 the uh inverter level per day and that
13:52 that would be um on a per let's say per
13:54 s key or per line item so that's how the
13:57 data is
14:00 um constructed right or uh the
14:02 architecture of the data so
14:03 so
14:06 um imagine that if you're seeing that if
14:09 you're seeing that data metric in this
14:10 particular View
14:13 and if you see that hey the the p-value
14:15 is not saying that you know it's
14:17 normally distributed what will you do
14:23 next will you collect new a new data set
14:27 or would you rather understand why do I
14:31 have that outlier there
14:34 it's not basically we're not targeting
14:35 basically that the data should be
14:38 normally distributed okay it's not
14:40 always the case because there are data
14:44 sets that are intended to be uh
14:46 non-normally distributed
14:48 all right and that's the common
14:50 misconception of course as mentioned
14:52 earlier there are tests that are intended
14:53 intended
14:55 for normally distributed data and there
14:57 are tests that are intended for
15:00 not normally distributed data and in
15:04 statistics there is in there's a gray
15:07 area wherein it's called the central
15:09 limit theorem
15:12 the central limit theorem states that
15:14 um in a certain number of data points if
15:17 you talk about I think 41 data points
15:19 the distribution of the data might be
15:23 non-normally distributed but as but if
15:25 we you know extend the data collection
15:27 and increase the number of data points
15:29 at some point
15:31 uh the data will follow in normal distribution
15:33 distribution
15:36 so with that principle it it is being
15:40 used abusively to
15:44 uh to use a parametric test to
15:46 non-normally distributed data okay
15:47 okay
15:49 so that's that's one gray area that I'm
15:51 seeing in the field of statistics and I
15:54 think if you read uh if you have some
15:56 some spare time if you read so many
16:00 articles there's a great debate about it um
16:00 um
16:03 but uh to to vlad's point yeah that's
16:05 correct you want to understand what what
16:08 causes that outlier and uh eventually
16:11 remove that outlier so that you can have
16:14 a better view not to make the data
16:15 normally distributed but to have a
16:18 better view of the data without that outlier
16:20 outlier
16:23 okay having that outlier there is an
16:26 Insight or would trigger you to
16:29 investigate but if you would want to be
16:31 saying properly it's either you would
16:33 remove that after you understand what
16:37 caused that or you would want to use
16:39 a central tendency measure that is not
16:41 affected by that outlier which is the
16:45 median rather than using the average
16:48 so we don't have to be concerned that uh
16:50 it's not normally distributed so I have
16:52 to collect another set of data that
16:53 could be the case if you want to prove
16:56 that supposedly this data should should
16:57 be following
17:00 a normal distribution but at some cases
17:01 there are really factors that you know
17:04 that causes it to not be
17:06 um normally distributed and that's one
17:08 story another story is that there are
17:10 data sets for example uh meantime
17:13 between failure or maybe uh customer
17:15 satisfaction index you would want your
17:17 customer satisfaction index the higher
17:19 the better so the chart will be
17:20 something like
17:23 um probably something like this right so
17:26 it's cute because you would want a five
17:29 year rather than a one year right if
17:30 you're talking about customer
17:32 satisfaction index
17:34 okay so for example output output is
17:36 supposedly higher the better so it
17:38 should be let's say for example you
17:40 would want you don't want 1000 output
17:42 but you would want 5 000 output so the
17:44 higher the better
17:47 okay but if we're talking about
17:49 uh data that has plus minus say for example
17:50 example
17:53 um resistance of a certain PCB
17:56 um certain PCB or a certain Electronic
17:57 Component so that's going to be plus
18:01 minus let's say two two ohms
18:03 Okay so
18:05 that should follow a normal distribution
18:09 since you have a plus minus
18:12 minus plus sides right so that's gonna
18:16 be uh how we we treat this um
18:18 um
18:21 normal distribution thing okay