0:03 hi we are now on the second graphical
0:04 analysis tool
0:07 and this time we will have box plot
0:10 box plot is also known as box plot and whiskers
0:12 whiskers
0:14 box plot is defined as a graphical
0:17 method of displaying variation
0:20 in a set of data in most cases
0:22 a histogram analysis provides a
0:24 sufficient display
0:26 but a box and whisker plot can provide
0:27 additional detail
0:30 while allowing multiple sets of data to
0:32 be displayed in the same graph
0:35 like histogram this falls under the data
0:37 distribution tools
0:40 but unlike histogram box plot can give
0:42 us an idea of central tendency
0:45 and dispersion better if you want to compare
0:45 compare
0:48 data sets and their measures of central tendency
0:49 tendency
0:51 and this per shot at the same time you
0:53 can actually use
0:56 box plot this is my personal favorite
0:58 because i have done so many things using
1:01 box plot and i have established a system
1:03 using a box plot that i can share with
1:06 you on the next section
1:08 now let's study the anatomy of a box
1:10 plot a box plot
1:12 technically divides the data set into four
1:13 four
1:16 partitions this partitions is called a quartile
1:17 quartile
1:20 just like how a year is being divided
1:21 into four
1:24 which we call a quarter now for every quartile
1:25 quartile
1:28 there is 25 percent of data set within it
1:28 it
1:32 now if we will divide a part into four
1:33 we will have a four
1:36 quartile meaning four 25 percent of
1:39 partition each the first partition is called
1:40 called
1:43 the first quartile wherein 25
1:46 of the data is located the second is the
1:47 second quartile
1:51 for the second quartile it would be 25
1:54 and another 25 which is 50 of the data
1:58 within it it is also known as the median
2:01 one of the central tendency measure
2:04 next is the third quartile third
2:06 quartile so we have to add another 25
2:08 percent from our second quartile so this is
2:09 is
2:13 75 of the data within it
2:15 and the last will be the fourth quartile
2:17 which is the 100
2:20 partition which means 100 of the data is
2:22 within this number
2:25 now we we can actually detect outlier
2:28 denoted by an asterisk symbol whenever
2:29 you are using
2:32 the box plot if the tails of what we
2:33 call the whiskers
2:37 this one and this one cannot contain
2:39 the data value the very high or very low
2:40 data value
2:44 then boxplot will tag it as an outlier
2:47 or an unusual observation if you can
2:49 still remember our study about central
2:51 tendency measures
2:53 mean is susceptible to outliers that is why
2:54 why
2:56 we are using median when we are using
2:59 box plot as our graphical analysis tool
3:03 in box plot central tendency is based on
3:04 the median
3:07 and dispersion is based on what we call
3:09 the inter-quartile range or
3:12 iqr iqr is the difference of the third quartile
3:14 quartile
3:17 and the first quartile practically speaking
3:18 speaking
3:21 if we have a greater amount of iqr
3:25 we have a greater picture of the box
3:28 and we have and if we have the greater
3:30 picture of the box or greater span of
3:30 the box
3:32 there is a higher amount of variation present
3:34 present
3:36 now if we want to check on the smaller
3:37 amount of variation
3:40 we're looking for smaller size of boxes
3:43 smaller the size of the box the smaller
3:44 the amount of the variation
3:47 so let's put it into practice we have to
3:49 go to minitab again
3:52 but before that let's go to our
3:53 worksheet and
3:56 copy energy okay so from worksheet we
3:57 have to copy here
3:59 the same dataset that we use in our
4:01 histogram case study
4:04 so this is again energy cost now we want
4:06 to check the distribution of the energy cost
4:08 cost
4:11 and uh using box plot
4:15 we have to click graph and then
4:19 find box plot here it is click box plot
4:22 now we have again simple y here one
4:24 column with one data so we will be using
4:25 this one
4:29 upper left we have to double click
4:32 and what is the variable data
4:35 energy cost double left click after that
4:37 we have to click ok
4:39 drag your worksheet down a little and
4:41 then adjust so you can see
4:44 now we have a box plot of distribution
4:46 of the energy cost
4:48 you can see here on the left side or the y-axis
4:49 y-axis
4:52 it's the value or data values of your
4:53 energy cost
4:56 so you don't have anything on your
4:58 x-axis because it's energy cos
5:01 as a function of the particular x-axis
5:02 but for now
5:05 it's black now how to read this you can
5:09 actually get the values of the quartiles
5:11 and everything that you want to know
5:13 about the box plot by putting your
5:17 mouse cursor on top of your box plot
5:20 so it revealed that first quartile is
5:22 197.5 percent
5:25 second quarter armenian is 320 third
5:28 quartile is 447.5
5:32 the iqr is 250 the whiskers
5:36 spans from 7 to 7 to 676
5:39 and the data values that we have is
5:42 25 data points now how to
5:47 interpret it only says that there is 25
5:50 of data that is equal to or less than 197
5:52 197
5:56 percent and the remaining 75 percent
6:00 is more than that value for median
6:04 it only says that there are 50 percent
6:08 of data above
6:10 and below that number because it's the midpoint
6:12 midpoint
6:16 now talking about the third quartile 447.5
6:17 447.5
6:21 there is 75 percent of data below that number
6:22 number
6:25 and the remaining 25 on top of that number
6:26 number
6:29 that is how we use the data and the
6:31 quartile values
6:34 as we use box plot in order for us to analyze
6:34 analyze
6:38 our data set and our data set has no
6:42 missing values nor it has no outlier
6:44 because we don't see any asterisk on the
6:45 chart that has been
6:48 created here say we have
6:51 a target value let's say our target value
6:52 value
6:55 is 300 now
6:58 let's put a reference line so
7:02 we have to right click here and then
7:06 click edit graph okay so this
7:08 graph will will appear then you have to
7:10 right click and then add
7:14 now we will be adding a reference line
7:17 why because we want to check whether
7:19 where are we as against our target so a
7:21 reference line could help us
7:24 visualize that so we want to
7:27 put a value of a reference line at y
7:28 value so
7:31 let's click 300 as mentioned
7:35 and then click ok and there will
7:38 appear a 300 here now click ok
7:41 for you to apply that now if we have a
7:45 target of 300 how can we answer the
7:47 question of how much of the data points
7:49 is already meeting
7:52 the 300 target so because this is cause
7:52 we want
7:56 lower the better again we have to
7:58 put our cursor on the box plot so we can
8:00 see the value of the median
8:03 so the closest value is the median okay
8:05 so you will be using the closest
8:08 value of the quartile in interpreting
8:12 for this case closer to 300 is
8:15 the median which is 320 using an
8:17 estimation based on the visual
8:19 output or the graphical output of our
8:20 box plot
8:23 we can say that almost 50 percent of the data
8:24 data
8:27 is actually meeting the target of 300
8:28 based on the median
8:31 so that is how we can use box plot to
8:33 interpret the results
8:36 of our data as we visualize them moving
8:38 forward you can use box plot
8:40 to check the distribution of your data
8:42 set as against your target
8:45 to check your process capability on how capable
8:45 capable
8:47 are you in meeting the target so you
8:49 will so you will have an idea or understanding
8:50 understanding
8:53 of how much is the problem that you are facing
8:54 facing
8:57 based on historical data now let's take
8:59 another example
9:02 this time we will use the same data set
9:03 from instagram
9:07 the fertilizer problem but now using
9:10 box plot so now let's go to our
9:21 i have to create a new worksheet i have
9:23 to close this one
9:26 and then paste it
9:28 now we will have to create a box plot
9:29 for this case study
9:33 we have to go to graph and then
9:36 box plot earlier we use
9:39 simple y because we have one column of
9:40 data but
9:43 we now have three columns of data so we
9:45 have to choose
9:47 the lower left which corresponds to
9:49 multiple y
9:52 then i have to click okay now i have to
9:54 repeat the same process
9:58 highlight c1 drag down and then select
10:00 make sure that all of the data variables
10:01 are here
10:04 if it's already there you have to click ok
10:04 ok
10:06 for you to generate the chart and then
10:09 you have your chart already
10:13 now let's move this and interpret
10:15 so as you can see this is now a
10:16 representation of
10:18 the previous data set that we have
10:20 regarding fertilizer
10:22 you can see here on the y-axis we have
10:23 the data
10:26 which is the plant height in centimeters
10:28 for the three conditions that we have
10:32 we have none grow fast and super bland
10:36 now as mentioned using a box plot
10:40 central tendency is given on using the
10:43 median so median remember median is the line
10:44 line
10:48 inside the box so these are the median
10:52 okay for this person it is
10:55 the height of the box so this
10:58 height this height of the boxes
11:01 now if you are asked to check which has the
11:02 the
11:06 highest amount of central tendency
11:09 and because this is to put context this
11:11 plant height so we want higher the
11:13 better so zero
11:17 to higher value so we're asking for
11:20 a central tendency measure or median
11:21 which is closer to
11:24 the upper part of this chart which is 40
11:26 on this particular axis
11:30 now which has the highest amount of
11:33 median among the three
11:36 okay so we have grow fast so we can check
11:37 check
11:41 for none the median is 18. for rufus the
11:43 median is 25.5
11:46 and for super plant the median is 21.
11:48 therefore the highest median
11:51 is yes grow fast
11:53 okay so that's for measure of sensual tendency
11:55 tendency
11:58 how about four measures of this person
12:00 we have to check on which has the smallest
12:01 smallest
12:04 box so which has a smallest box
12:09 using visual judgment we have
12:13 grow fast again we can check using what
12:16 yes iqr so iqr
12:19 for none it's eight for growfus
12:22 it's 6.25 and for
12:25 super plant it's eight therefore the smallest
12:26 smallest
12:30 amount of variation can be found in
12:32 grow fast because it has the smallest iqr
12:33 iqr
12:35 and visually speaking it has the smallest
12:36 smallest
12:40 size of the box okay so therefore
12:43 you can use box plot if you want to
12:45 compare categories of data
12:48 it should be continuous data your y is
12:50 continuous data because these are your x's
12:51 x's
12:54 plant height in terms of
12:57 condition of whether there is fertilizer
12:59 or no fertilizer so again y
13:02 is continuous and x is categorical
13:05 so if you have that kind of data set
13:07 therefore you can use this
13:09 for your data analysis or your root
13:11 cause analysis
13:13 moving forward you will be using box
13:14 plot as you prove
13:17 your root cause analysis in your case study