Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Statistics - A Full Lecture to learn Data Science (2025 Version) | numiqo | YouTubeToText
YouTube Transcript: Statistics - A Full Lecture to learn Data Science (2025 Version)
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
hi everyone Welcome to our full and free
tutorial about statistics before we dive
in we want to say a huge thank you last
year we launched our first ever full
course on statistics and the response
was absolutely amazing we are on track
to reach 1 million views within just one
year wow this inspired us to create a
revised version for
2025 it was a big project but there it
is thank you for being such an amazing
community so let's start this video is
designed to guide you through the
fundamental concepts and most powerful
statistical tests used in research today
from the basics of descriptive
statistics to the complexities of
regression and Beyond we'll explore how
each method fits into the bigger picture
of data analysis and don't worry if you
have no no clue about statistics will go
through everything step by step if you
like you can find all topics in our book
as well the link is in the video
description so what is the outline of
the video our video has three major
parts in the first part we discuss what
statistics is and what the differences
between descriptive and inferential
statistics are in the second part we go
through the most common hypothesis test
like T Test and an NOA and discuss the
difference between parametric and
non-parametric tests in the third part
we take a look at correlation analysis
and regression analysis and finally we
talk about clust analysis we have
prepared detailed videos for each
section so let's start with the video
that explains what statistics is after
this video you will know what statistics
is what descriptive statistics is and
what inferential statistics is so let's
start with the first question what is
statistics statistics deals with the
collection analysis and presentation of
data an example we would like to
investigate whether gender has an
influence on the preferred newspaper
then gender and newspaper are our
so-called variables that we want to
analyze in order to analyze whether
gender has an influence on the preferred
newspaper we first need to collect data
to do this we create a questionnaire
that asks about gender and preferred
newspaper we will then send out the
survey and wait 2 weeks afterwards we
can display the received answers in a
table in this table we have one column
for each variable one for gender and one
for newspaper on the other hand each row
is the response of one served person the
first respondent is mail and stated New
York Post the second is female and
stated USA Today and so on and so forth
of course the data does not have to be
from a survey the data can also come
from an experiment in which you for
example want to study the effect of Two
drugs on blood pressure now the first
step is done we have collected data and
we can start analyzing the data but what
do we actually want to analyze we did
not survey the entire population but we
took a sample now the big question is do
we just want to describe the sample data
or do we want to make a statement about
the whole population if our aim is
limited to the sample itself I.E we only
want to describe the collected data we
will use descriptive statistics
descriptive statistics will provide a
detailed summary of the sample
however if we want to draw conclusions
about the population as a whole
inferential statistics are used this
approach allows us to make educated
guesses about the population based on
the sample data let us take a closer
look at both methods starting with
descriptive statistics why is
descriptive statistics so important
let's say a company wants to know how
its employees travel to work so the
company creates a survey to answer this
question once enough data has been
collected this data can be analyzed
using descriptive statistics but what is
descriptive statistics descriptive
statistics aims to describe and
summarize a data set in a meaningful way
but it is important to note that
descriptive statistics only describe the
collected data without drawing
conclusions about a larger population
put simply just because we know how some
people from one company get to work we
cannot say how all working people of the
company get to work this is the task of
inferential statistics which we will
discuss later to describe data
descriptively we now look at the four
key components measures of central
tendency measures of dispersion
frequency tables and charts let's start
with the first one measures of central
tendency measures of central tendency
are for example the mean the median and
the mode Let's first have a look at the
mean the arithmetic mean is the sum of
all observations divided by the number
of observations an example imagine we
have the test scores of five students to
find the mean score we sum up all the
scores and divide by the number of
scores the mean test score of these five
students is therefore
86.6 what about the median when the
values in a data set are are arranged in
ascending order the median is the middle
value if there is an odd number of data
points the median is simply the middle
value if there is an even number of data
points the median is the average of the
two middle values it is important to
note that the median is resistant to
extreme values or outliers let's look at
this example no matter how tall the last
person is the person in the middle
Remains the person in the middle so the
median does not change but if we look at
the mean it does have an effect on how
tall the last person is the mean is
therefore not robust to outliers let's
continue with the mode the mode refers
to the value or values that appear most
frequently in a set of data for example
if 14 people travel to work by car six
by bike five walk and five take public
transport then car occurs most often and
is therefore the mode great let's
continue with the measures of dispersion
measures of dispersion describe how
spread out the values in a data set are
measures of dispersion are for example
the variance and standard deviation the
range and the interquartile range let's
start with the standard deviation the
standard deviation indicates the average
distance between each data point and the
mean but what does that mean each person
has some deviation from the mean now we
want to know how much the person's
deviate from the mean value on average
in this example the average deviation
from the mean value is 11.5 cm to
calculate the standard deviation we can
use this equation Sigma is the standard
deviation n is the number of persons it
x i is the size of each person and xbar
is the mean value of all persons but
attention there are two slightly
different equations for the standard
deviation the difference is that we have
once 1 / n and once's 1 /
nus1 to keep it simple if our servy
doesn't cover the whole population we
always use this equation to estimate the
standard deviation likewise if we have
conducted a clinical study then we also
use this equation to estimate the
standard deviation but what is the
difference between the standard
deviation and the variance as we now
know the standard deviation is the
quadratic mean of the distance from the
mean the variance now is the squared
standard deviation if you want to know
more details about the standard
deviation and the variance please watch
our video let's move on to range and
interquartile range it is easy to
understand the range is simply the
difference between the maximum and
minimum value inter quartile range
represents the middle 50% of the data it
is the difference between the first
quartile q1 and the third quartile Q3
therefore 25% of the values are smaller
than the inter quartile range and 25% of
the values are larger the inter quartal
range contains exactly the middle 50% of
the values before we get to the last two
points let's briefly compare measures of
central tendency and measures of
dispersion let's say we measured the
blood pressure of patients measures of
central tendency provide a single value
that represents the entire data set
helping to identify a central value
around which data points tend to Cluster
measures of dispersion like the standard Devi
Devi
the range and the inter quarle range
indicate how spread out the data points
are whether they are closely packed
around the center or spread far from it
in summary while measures of central
tendency provide a central point of the
data set measures of dispersion describe
how the data is spread around the center
let's move on to tables here we will
have a look at the most important ones
frequency tables and contingency tables
a frequency table displays how often
each distinct value appears in a data
set let's have a closer look at the
example from the beginning a company
surveyed its employees to find out how
they get to work the options given were
car bicycle walk and public transport
here are the results from 30 employees
the first answer car the next walk and
so on and so forth now we can create a
fre table to summarize this data to do
this we simply enter the four possible
options car bicycle walk and public
transport in the First Column and then
count how often they occurred from the
table it is evident that the most common
mode of Transport among the employees is
by car with 14 employees preferring it
the frequency table thus provides a
clear and concise summary of the data
but what if we have not only one but two
categorical variables this is where the
contingency table also called cross tab
comes in Imagine the company doesn't
have one Factory but two one in Detroit
and one in Cleveland so we also ask the
employees at which location they work if
we want to display both variables we can
use a contingency table a contingency
table provides a way to analyze and
compare the relationship between two
categorical variables the rows of a
contingency table represent the
categories of one variable while the
columns represent the categories of
another variable each cell in the table
shows the number of observations that
fall into the corresponding category
combination for example the first cell
shows that car and Detroit were answered
six times and what about the charts
let's take a look at the most important
one to do this let's simply use data.net
if you like you can load this sample
data set with the link in the video
description or you just copy your own
data into this table here below you can
see the variables distance to work mode
of transport and site data daab gives
you a hint about the level of
measurement but you can also change it
here now if we only click on mode of
Transport we get a frequency table and
we can also display the percentage
values if we scroll down we get a bar
chart and a pie chart here on the left
we can adjust further settings for
example we can specify whether we want
to display the frequences or the
percentage values or whether the bars
should be vertical or
horizontal if you also select site we
get a cross table here and a grouped bar
chart for for the diagrams here we can
specify whether we want the chart to be
grouped or stacked if we click on
distance to work and mode of Transport
we get a bar chart where the height of
the bar shows the mean value of the
individual groups here we can also
display the
dispersion we also get a histogram a box
plot a violin plot and a rainbow plot if
you would like to know more about what a
Bo Bo plot a violin plot and a rainbow
plot are take a look at my videos let's
continue with inferential statistics at
the beginning we briefly go through what
inferential statistics is and then I'll
explain the six key components to you so
what is inferential statistics
inferential statistics allows us to make
a conclusion or inference about a
population based on data from a sample
what is the population and what is the
sample the population is the whole group
we're interested in if you want to study
the average height of all adults in the
United States then the population would
be all adults in the United States the
sample is a smaller group we actually
study chosen from the population for
example 150 adults were selected from
the United States and now we want to use
the sample to make a statement about the
population and here are the six steps
how to do that number one hypothesis
first we need a statement a hypothesis
that we want to test for example you
want to know whether a drug will have a
positive effect on blood pressure in
people with high blood pressure but
what's next in our hypothesis we stated
that we would like to study people with
high blood pressure so our population is
all people with high blood pressure in
for example the US obious obviously we
cannot collect data from the whole
population so we take a sample from the
population now we use this sample to
make a statement about the population
but how do we do that for this we need a
hypothesis test hypothesis testing is a
method for testing a claim about a
parameter in a population using data
measured in a sample great that's
exactly what we need there are many
different hypothesis tests and at the
end of this video I will give you a
guide on how to find the right test and
of course you can find videos about many
more hypothesis tests on our Channel but
how does a hypothesis test work when we
conduct a hypothesis test we start with
a research hypothesis also called
alternative hypothesis this is the
hypothesis we are trying to find
evidence for in our case the research
hypothesis is the drug has an effect on
blood pressure but we cannot test this
hypothesis directly with a classical
hypothesis test so we test the opposite
hypothesis that the drug has no effect
on blood pressure but what does that
mean first we assume that the drug has
no effect in the population we therefore
assume that in general people who take
the drug and people who don't take the
drug have the same blood pressure on
average if we now take a random sample
and it turns out that the drug has a
large effect in a sample then we can ask
How likely it is to draw such a sample
or one that deviates even more if the
drug actually has no effect so in
reality on average there is no
difference in the population if this
probability is very low we can ask
ourselves maybe the drug has an effect
in the population and we may have enough
evidence to reject the Nile hypothesis
that the drug has no effect
and it is this probability that is
called the P value let's summarize this
in three simple steps number one the
null hypothesis states that there is no
difference in the population number two
the hypothesis test calculates how much
the sample deviates from the null
hypothesis number three the P value
indicates the probability of getting a
sample that deviates as much as our
sample or one that even deviates more
than our sample assuming the null
hypothesis is true but at what point is
the P value small enough for us to
reject the Nile hypothesis this brings
us to the next Point statistical
significance if the P value is less than
a predetermined threshold the result is
considered statistically significant
this means that the result is unlikely
to have occurred by chance alone and
that we have enough evidence to re dect
anal hypothesis this threshold is often
0.05 therefore a small P value suggests
that the observed data or sample is
inconsistent with the null hypothesis
this leads us to reject the null
hypothesis in favor of the alternative
hypothesis a large P value suggests that
the observed data is consistent with the
null hypothesis and we will not reject
it but note there is always a risk of
making an error a small P value does not
prove that the alternative hypothesis is
true it is only saying that it is
unlikely to get such result or a more
extreme when the Nal hypothesis is true
and again if the Nal hypothesis is true
there is no difference in the population
and the other way around a large P value
does not prove that the Nal hypothesis
is true it is only saying that it is
likely to get such a result or a more extreme
extreme
when the null hypothesis is true so
there are two types of Errors which are
called type one and type two error let's
start with the type one error in
hypothesis testing a type one error
occurs when a true null hypothesis is
rejected so in reality the null
hypothesis is true but we make the
decision to reject the null hypothesis
in our example it means that the drug
actually had no effect so in reality
there there is no difference in blood
pressure whether the drug is taken or
not the blood pressure Remains the Same
in both cases but our sample happened to
be so far off the True Value that we
mistakenly thought the drag was working
and a type two error occurs when a false
null hypothesis is not rejected so in
reality the null hypothesis is false but
we make the decision not to reject an Al
hypothesis in our example this means the
drug actually did work there is a
difference between those who have taken
the drug and those who have not but it
was just a coincidence that the sample
taken did not show much difference and
we mistakenly thought the drug was not
working and now I'll show you how data
helps you to find a suitable hypothesis
test and of course calculates it and
interprets the results for you let's go
to data.net and copy your own data in
here we will just use this example data
set after copying your data into the
table the variables appear down here
data tab automatically tries to
determine the correct level of
measurement but you can also change it
up here now we just click on hypothesis
testing and select the varibles we want
to use for the calculation of a
hypothesis test data tab will then
suggest a suitable test for example in
this case a Ki Square test or in that
case an analysis of
variance then you will see the
hypothesis and the results if you're not
sure how to interpret the results click
on summary inverts further you can check
the assumptions and decide whether you
want to calculate a parametric or a
nonparametric test now we know the
differences between descriptive and
inferential statistics our next step is
to take a closer look at inferential
statistics and a choosing the
appropriate hypothesis test which
hypothesis test you can use depends on
the level of measurement of your data
there are four types of levels of
measurement nominal ordinal interval and
ratio and here is an easy explanation
for you in this video we we are going to
explore the four levels of measurement
nominal ordinal interval and ratio each
level gives us important information
about the variable and supports
different types of statistical analysis
by the end of this video you will know
what the levels of measurement are and
especially you will understand why you
need these levels so whether you are
analyzing survey data optimizing
business operations or studying for a
statistics Exam
stay tuned what are levels of
measurement levels of measurement refer
to different ways that variables can be
Quantified or categorized if you have a
data set then every variable in the data
set corresponds to one of the four
primary levels of measurement these
levels are nominal ordinal interval and
Rao in practice interval and ratio data
are often used to perform the same
analysis therefore the term matric level
is used to combine these two levels why
do you need levels of measurement the
level of measurement is crucial in
statistics for several key reasons it
tells us how our data can be collected
analyzed and interpreted here's why
understanding these levels is so
important different levels of
measurement support different
statistical analysis for instance mean
and standard deviation are suitable for
matric data in some cases it may be
suitable for ordinal data but only if
you know how to interpret the results
correctly and it definitely makes no
sense to calculated for nominal data the
level of measurement also tells us which
hypothesis tests are possible and
determines the most effective type of
data visualization for example bar
charts are great for nominal data while
histograms are better suited for metric
data so each level provides different
information and supports various types
of statistical analysis but attention
the level of measurement is mainly
relevant at the end of the research
process however the types of data to be
collected and are formed are determined
at the beginning therefore it is crucial
to consider the level of measurement of
the data from the start part to ensure
that the desired tests can be conducted
at the end so let's take a closer look
at each level of measurement what
characterizes nominal variables this is
the most basic level of measurement
nominal data can be categorized but it
is not possible to rank the categories
in a meaningful way examples of nominal
variables are gender with the categories
male and female types of animals with
for example the categories dog cat bird
or preferred newspaper in all these
cases you can tell whether one value
corresponds to the other so you can
distinguish the values but it is not
possible to put the categories in a
meaningful order an example we would
like to investigate whether chender has
an influence on the preferred newspaper
both variables are nominal so when we
create a questionnaire we simply list
the possible answers for both variables
since there is no meaningful order pH
nominal variables it usually does not
matter in which order the categories are
listed in the questionnaire then we can
display the collected data in a table
where each row is a person with the
respective answer we can now use our
data to create frequency tables or bar
charts but what about the ordinal level
of measurement ordinal data can be
categorized and in comparison with ninal
data it is possible to have a meaningful
ranking of the categories but
differences between ranks do not have a
mathematical meaning this means the
intervals between the data points are
not necessarily equal examples of
ordinant variables are all kinds of
rankings like first second third
satisfaction ratings very unsatisfied
unsatisfied neutral satisfied very
satisfied and levels of Education High
School Bachelors Masters in a
questionnaire you could ask how
satisfied are you with your current job
in this case we have these five possible
options the answers can be categorized
and there is a logic order that's why
the variable satisfaction with the job
is an ordinal variable
what about matric variables matric
variables are the highest level of
measurement matric data is like ordinal
but the intervals between values are
equally spaced this means the
differences and sums can be formed
meaningfully examples of mic variables
are Income weight age and electricity
consumption if you ask for a matric
variable in a questionnaire there's
usually just an input field in which the
person directly enters the value for
example age or body weight let's look at
what we've learned so far using an
example imagine you're conducting a
survey in a school to understand how
pupils get to school here are questions
you might ask each corresponding to a
different level of measurement the first
question could be what mode of
transportation do you use to get to
school bus car bicycle walk this is of
course a nominal variable the answers
can be categorized but there is no
meaningful order this means that bus is
not higher than bicycle walk is not
higher than car and so on and so forth
if you want to analyze the results of
this question you can count how many
students use each mode of transportation
and present it in a bar chart further
you can ask ask how satisfied are you
with your current mode of transportation
choices might include very unsatisfied
unsatisfied neutral satisfied very
satisfied this is of course an ordinal
variable you can rank the responses to
see which mode of transportation ranks
higher in satisfaction but the exact
difference between satisfied and very
satisfied for example or other options
isn't quantifiable and the last question
how many minutes does it take you to get
to school minutes to get to school is a
metric variable here you can calculate
the average time to get the school and
use all standard statistical measures we
can visualize this data with a histogram
showing the distribution of times you
get to school and compare the different
Transportation modes so using nominal
data we can categorize and count
responses but cannot inere any order
ordinal data allows us to rank responses
but not to measure precise differences
between ranks matric data enables us to
measure exact differences between data
points as already mentioned matric level
of measurement can be further subdivided
into interval scale and ratio scale but
what is the difference between interval
and ratio level let's look at an example
in a marathon of course the time of the
marathon runners is measured let's say
the first one took 2 hours and the last
one finished the marathon in 6 hours
here we can say that the fastest runner
was three times as fast as the slowest
or to put it the other way around the
slowest one took three times as long as
the fastest one this is possible because
there is a true zero point at the
beginning of the marathon where all
Runners start from zero in this case we
have ratio level of measurement if
however the stopwatch is forgotten to
start at the beginning of the race and
only the differences are measured
starting from the fastest runner we
don't have this true zero now the
runners cannot be put in proportion in
this case we can say how big the
interval between the runners is for
example the fastest runner is 4 hours
was faster than the slowest runner but
we cannot say that the fastest runner
was three times as fast as the slowest
this is because we don't know the
absolute values for both Runners we
still have equal intervals we can say
things like Runner B finished one hour
after the fastest runner and Runner C
finished 1 hour and 45 minutes after the
fastest runner the time differences are
measurable and meaningful but since
there is no true zero point we can say
the fastest runner was x times as fast
as the slowest runner we only know how
much lat that the other Runners finished
relative to the fastest runner but not
that total running times and in this
case we have an interval level of
measurement in summary while both
interval and ratio scales have equal
intervals and support similar operations
like addition and subtraction ratio
scales have a true zero point zero
represents the absence of the quantity
being measured this allows meaningful
multiplication and division and now a
little exercise to check whether
everything is clear to you first we have
state of the US which is a nominal level
of measurement this means the data is
used for labeling or naming categories
without without any quantitative value
in this case the states are names with
no inherent order or ranking next we
have product ratings on a scale from 1
to five this is an example of ordinal
data here the numbers do have an order
or rank five is better than one but the
intervals between the ratings are not
necessarily equal moving on to religious
confession like the state stes this is
also nominal the categories here such as
different religions are for
categorization and do not imply any
order next we have CO2 emissions in the
year which is measured on a metric ratio
scale this level allows for the full
range of mathematical operations
including meaningful ratios zero
emissions mean no emissions at all then
we have telephone numbers although
telephone numbers are numeric they are
categorized as nominal they are just
identifiers with no numerical value for
analysis care level of patience is
another ordinal example this might
include levels such as low medium and
high care which indicate an order but
not the exact difference between these
levels living space in square meters is
measured on a ratio scale like CO2
emissions Co sare meters mean there is
no living space and comparisons like
double or half are meaningful lastly we
have chop satisfaction on a scale from 1
to four this is ordinal data it ranks
satisfaction levels but the difference
between each level isn't
Quantified now we know what the level of
measurement is and we can go through the
hypothesis tests that are most popular
and discuss when to use which tests
let's start with the video on the most
common hypothesis test the tea test this
video is about everything you need to
know about the Tea test after this video
you know what a te test is and when you
use it what types of te tests there are
what the hypotheses and the assumptions
are how a te test is calculate it and
how you interpret the results let's
start with the first question what is a
T Test the T Test is a statistical test
procedure hm and what does the T Test do
the T Test analyzes whether there is a
significant difference between the means
of two groups for example the two groups
may be patients who received once drag a
and once drag B we would now like to
know if there is a difference in blood
pressure between these two groups now
there are three different types of T
tests the one sample T Test the
independent samples T Test and the pair
samples T Test when do we use a one
sample T Test we use the one sample T
Test when we want to compare the mean of
a sample with a known reference mean
example a chocolate bar manufacturer
claims that its chocolate bars weigh an
average of 50 g to check this we take a
sample of 30 bars and weigh them the
mean value of this sample is 48 G now we
can use a one sample T test to check if
the mean of 48 G is significantly
different from the claimed 50 g when do
we use the independent samples T Test we
use the T test for independent samples
when we want to compare the means of two
independent groups or samples we want to
know if there is a significant
difference between these means example
we would like to compare the
effectiveness of two
painkillers we randomly divide 60 people
into two groups the first group receives
track a and the second second group
receives Str B using an independent T
Test we can now test whether there is a
significant difference in pain relief
between the two drugs when we use the
pair samples T Test we use the paired
samples T Test to compare the means of
two dependent groups example we want to
know how effective a diet is to do this
we weigh 30 people before the diet and
then weigh exactly the same people after
the diet now we can look at the
difference in weight between before and
after for each subject we can now use a
paired samples T test to test whether
there is a significant difference in a
paired sample the measurements are
available in pairs the pairs result for
example from repeated measurements with
the same people independ dep samples are
made up of people and measurements that
are independent of each other here's an
interesting note the paired samples T
Test is very similar to the one sample T
Test we can also think of the paired
samples T Test as having one sample that
was measured at two different times we
then calculate the difference between
the paired values giving us a value for
one sample the differ difference is 1's
- 5 1's + 2 1's -1 and so on and so
forth now we want to test whether the
mean value of the difference just
calculated deviates from a reference
value in this case zero this is exactly
what the one sample T Test does what are
the assumptions for a t test of course
we first need a suitable sample in the
one sample T Test we need a sample and
the reference value in the independent T
Test we need two independent samples and
in the case of a pair T Test a paired
sample the variable for which we want to
test whether there is a difference
between the means must be metric
examples of matric variables are age
body weight and income for example a
person's level of education is not a
metric variable in addition the matric
variable must be normally distributed in
all three test variants to learn how to
test if your data is normally
distributed watch my video test for
normal distribution in case of an
independent T Test the variances in the
two groups must be approximately equal
you can check whether the variances are
equal using lavine's test for more
information watch my video on lavine's
test so what are the hypotheses of the T
Test let's start with the one sample T
Test in the one sample T Test the null
hypothesis is the sample mean is equal
to the given reference value so there's
no difference and the alternative
hypothesis is the sample mean is not
equal to the given reference value what
about the independent samples T Test in
the independent T Test the null
hypothesis is the mean values in both
groups are the same so there is no
difference between the two groups and
the alternative hypothesis is the mean
values in both groups are not equal so
there is a difference between the two
groups and finally the pair samples T
Test in a pair T Test the Nile
hypothesis is the mean of the difference
between the pairs is zero and the
alternative hypothesis is the mean of
the difference between the pairs is not
zero so now we know what the hypotheses
are before we look at how the te test is
calculated let us look at an example of
why we actually need a te test let's say
there is a difference in the length of
study for a bachelor's degree between
men and women in Germany our population
is therefore made up of all graduates of
a bachelor who have studied in Germany
however as we cannot survey all Bachelor
graduates we draw a sample that is as
representative as possible we now use
the test to test the null hypothesis
that there is no difference in the
population if there is no difference in
the population we will certainly still
see a difference in study duration in
the sample it would be very unlikely
that we drew a sample where the
difference would be exactly zero in
simple terms we now want to know at what
difference measured in a sample we can
say that the duration of study of men
and women is significantly different and
this is exactly what the T Test answers
but how do we calculate a T Test to do
this we first calculate the T value to
calculate the T value we need two values
first we need the difference between the
means and then we need the standard
deviation from the mean this is also Al
known as the standard error in the one
sample T Test we calculate the
difference between the sample mean and
the known reference mean s is the
standard deviation of the collected data
and N is the number of cases s / the
square root of n is then the standard
deviation from the mean which is the
standard error in the dependent samples
T Test we simply calculate the
difference between the the two sample
means to calculate the standard error we
need the standard deviation and the
number of cases from the first and
second sample depending on whether we
can assume equal or unequal variance for
our data there are different formulas
for the standard error read more about
this in our tutorial on data.net in a
paired sample T Test we only need to
calculate the difference between the
paired values and calculate to mean from
that the standard error is then the same
as for a one sample T Test so what have
we learned so far about the T value no
matter which T Test we calculate the T
value will be greater if we have a
greater difference between the means and
the T value will be smaller if the
difference between the means is smaller
further the T value becomes smaller when
we have a larger disperson of the mean
so the more scattered the data the less
meaningful a given mean difference is
now we want to use the T test to see if
we can reject the null hypothesis or not
to do this we can now use the T value in
two ways either we read the critical T
value from a table or we simply
calculate the P value from the T value
we'll go through both in a moment but
what is the P value a T Test over always
test the null hypothesis that there is
no difference so first we assume that
there is no difference in the population
when we draw a sample this sample
deviates from the null hypothesis by a
certain amount the P value tells us How
likely it is that we would draw a sample
that deviates from the population by the
same amount or more than a sample we
drew thus the more the sample deviates
from the null hypothesis the smaller the
P value becomes if this probability is
very very small we can of course ask
whether the null hypothesis holds for
the population perhaps there is a
difference but at what point can we
reject an Al hypothesis this border is
called the significance level which is
usually set at 5% so if there is only a
5% chance that we draw such a sample or
one that is more different then we have
enough evidence to assume that we reject
the null hypothesis and to put it simple
we assume that there is a difference
that the alternative hypothesis is true
now that we know what the P value is we
can finally look at how the T value is
used to determine whether or not the
null hypothesis is rejected let's start
with the path through the critical T
value which you can read from from a
table to do this we first need a table
of critical T values which we can find
on data.net under tutorials and T
distribution let's start with the
two-tailed case we'll briefly look at
the one tail case at the end of this
video here below we see the table first
we need to decide what level of
significance we want to use let's choose
a significance level of
0.05 or 5% then we look in this column
at 1 -
0.05 which is
0.95 now we need the degrees of freedom
in the one sample T Test and the paired
samples T Test the degrees of freedom
are simply the number of cases minus one
so if we have a sample of 10 people
there are 9 degrees of freedom in the
independent sample T Test we add the
number of people from both samples and
calculate that minus two because we have
two samples note that the degrees of
freedom can be determined in a different
way depending on whether we assume equal
or unequal variance so if we have a 5%
significance level and 9 degrees of
freedom we get a critical T value of
2.262 now on the one hand we've
calculated a t value with the T Test and
we have the critical T value if our
calculated T value is greater than the
critical T value we reject the null
hypothesis for example suppose we
calculate a t value of 2.5 this value is
greater than
2.262 and therefore the two means are so
different that we can reject the N
hypothesis on the other hand we can also
calculate the P value for the T value
we've calculated if we enter 2.5 for the
T value and 9 for the degrees of freedom
we get a P value of
0.034 the P value is less than
0.05 and we therefore reject the null
hypothesis as a control we copy the T
value of
2.262 here we get exactly a P value of
0.05 which is exactly the limit if you
want to calculate a t test with data tab
you just need to copy your own data in
this table click on hypothesis test and
then select the variables of interest
for example if you want to test whether
gender has an effect on income you
simply click on the two variables and
automatically get a T Test calculated
for independent samples here below you
can read the P value if you're still
unsure about the interpretation of the
results you can simply click on
interpretation in words a two-tail t
test for independent samples equal
variances assumed showed that the
difference between female and male with
respect to the dependent variable salary
was not statistically significant thus
the null hypothesis is retained the
final question now is what is the
difference between directed hypothesis
and undirected hypothesis in the
undirected case the alternative
hypothesis is that there is a difference
for example there is a difference
between the salary of men and women in
Germany we don't care who earns more we
just want to know if there is a
difference or not in a directed
hypothesis we are also interested in the
direction of the difference for example
example the alternative hypothesis might
be that men earn more than women or
women earn more than men if we look at
the T distribution graphically we can
see that in the two-sided case we have a
range on the left and the range on the
right we want to reject the null
hypothesis if we are either here or
there with a 5% significance level both
ranges have a probability of two .5%
together just 5% if we do a one tail T
Test the null hypothesis is rejected
only if we are in this range or
depending on the direction which we want
to test in that range with a 5%
significance level all 5% fall within
this range we've seen how the test is a
powerful tool for comparing the means of
two groups to determine why they differ sign
sign
significantly but what if we want to
extend our analysis to more than two
groups this is where analysis of
variance or Anova comes into play so
let's get started with Anova this video
is about analysis of variance or Anova
for short we discuss what an anova is
and why you need it we look at the
hypotheses and the assumptions for an
anova and I will show you how to
calculate an anova and what the
equations behind an anova are finally we
will discuss how to interpret the
results and take a look at the posst hog
test let's get started first of all
there are different types of analysis of
variant the simplest and most common one
is the oneway Anova and that's exactly
what this video is about what is a
oneway Anova the oneway Anova is a
hypothesis test okay and what does the
One Way Anova test and an NOA tests
whether there are statistically
significant differences between three or
more groups more precisely it is tested
whether there is a significant
difference between the mean values of
the groups an example you want to
investigate whether there is a
difference between fertilizers a b and c
in terms of plant growth the analysis of
variance now helps you to find out
whether there is a significant
difference in plant growth between the
three groups but doesn't it T Test do
something similar that's true the T test
tests whether there is a difference
between two groups the analysis of
variance is now an extension of the T
test and it tests whether there is a
difference between more than two groups
so if you only had two fertilizers A and
B that you wanted to compare you would
use a t test if you have three or more
different fertilizers you would use an
anova okay that makes sense what are the
hypothesis test it in an anova in an
anova the null hypothesis is that the
mean values of the groups are equal
mathematically this can be expressed as
mu1 is equal to mu2 is equal to Mu K
where mu1 to Mu K represents the mean
values of the different groups and the
alternative hypothesis is that the mean
values of the groups are not equal this
can be expressed mathematically as
follows at least one mu is different
where e stands for one of the groups
under consideration what are the
assumptions of an anova and how is it
calculated let's look at this with the
help of an example let's say you are a
researcher and you want to find out
whether three different drugs have a
different effect on blood pressure so
your research question is do the three
drugs have a different effect on blood
pressure the null hypothesis is there is
no different between the three drugs in
terms of blood pressure and the
alternative hypothesis is there is a
difference between the three drugs in
terms of blood pressure in order to
calculate an anova we of course need
data how do we get the data we obtain
data by taking a sample from the
population we need to analyze for
example all people suffering from high
blood pressure let's say we took a
sample of 24 tests subjects we can then
divide these 24 test subjects into three
groups with each group receiving a
different rack okay but now we still
don't have any data we only have groups
for now that's right in order to be able
to calculate an anova we have to measure
something for each subject in our case
the blood pressure we measure the blood
pressure for each test person once at
the beginning and once at the end of the
treatment the difference between the two
values tells us how the blood pressure
has changed while taking the medication
we have now collected data but what
about the assumptions we still have to
check them don't we exactly so that you
can calculate an anova these assumptions
must be met number one level of
measurement first the levels of
measurement for your variable must be
appropriate for an anova the independent
variable should be nominal such as the
type of medic ation or the type of
fertilizer applied the dependent
variable on the other hand must be
metric like blood pressure or plant
growth to summarize we need a variable
that defines the different groups such
as the different medications and the
variable that reflects what is measured
in a different groups for example blood
pressure but now to the second
assumption Independence the measurements
should be independent I.E the measured
value of one group should not be
influenced by the measured value of
another group the third assumption is
that of normal distribution the data
within the groups should be normally
distributed this assumption becomes less
important as the sample size increases
but should still be evaluated especially
for small samples if this assumption is
violated the cross called Wallace test
can be used as an alternative if you
would like to know how to test data for
normal distribution please watch my
video test for normal distribution and
thus to the fourth assumption the
variances in each group should be
roughly the same in other words it
should not be the case that we have a
very large variance or a very small
variance in one group compared to the
other group this assumption can be
tested with the Levine test you can find
more information in my video on the
Levine test if this assumption is not
met the Welch Anova can be used as
alternative okay but how does an anova
work we want to know whether there is a
difference between the groups to test
whether there is a difference the Anova
uses the ratio of the variance between
the groups and the variance within the
groups what is the between group
variance and what is the within group
variance the variance between the groups
measures how much the mean values of the
groups differ from each other in our
case the deviation of the mean values of
groups a b and c in this case we would
have a big variance in that case a small
one what is the variance within the
groups the within group variance
measures how much the individual data
points in each group fluctuate in this
case we would have a large variance
within the groups in that case a small
variance the ratio between the variance
between the groups and the variance with
within the groups is now the so-called F
value or F statistic if the variance
between the groups is small and the
variance within the groups is large then
we get a small F value in this case it
is likely that these deviations or even
larger ones occur purely by chance even
if in reality there is no difference
between the groups however if we have a
small spread within the groups and a
large spread between the groups then the
F value is large it is then very
unlikely that such a difference or an
even more extreme difference would arise
purely by chance or to be more precise
it is very unlikely that we will obtain
such an F value or an even larger one
therefore to assess whether the
differences between the groups are
statistically significant we use the F
value however this F value does not give
us a DE definite answer to assess the
significance of the results more
precisely we need the P value so let's
calculate the P value let's say this is
our data we have the values from the
first second and third group so now we
start and calculate the variance between
the groups and the variance within the
groups first of all we need the total
mean and the mean values of the
individual groups we obtain the total
mean by adding up all the values and
dividing by the number of values we have
a total of 24 values this gives us a total mean of
total mean of 5.3 we get the mean values of the groups
5.3 we get the mean values of the groups by simply adding up all the values of
by simply adding up all the values of the respective group and dividing by the
the respective group and dividing by the number of values per group now we've
number of values per group now we've calculated all the required mean values
calculated all the required mean values and we can calculate the so-called sum
and we can calculate the so-called sum of squares between the groups to do this
of squares between the groups to do this we calculate the difference between each
we calculate the difference between each group mean and the total mean Square
group mean and the total mean Square this and add up the results n is the
this and add up the results n is the number of values in the E Group so we
number of values in the E Group so we get 2.5 - 5.3 sared + 8 - 5.3 sared +
get 2.5 - 5.3 sared + 8 - 5.3 sared + 5.5 - 5.3 sared the sum of the squares
5.5 - 5.3 sared the sum of the squares between the groups is therefore
calculate the sum of squares within the groups to calculate this we subtract the
groups to calculate this we subtract the mean of each group from each individual
mean of each group from each individual value within that group so we subtract
value within that group so we subtract the mean of the first group from each of
the mean of the first group from each of its values the mean of the second group
its values the mean of the second group from each of its values and the mean of
from each of its values and the mean of the third group from each of its values
the third group from each of its values we Square the differences obtained and
we Square the differences obtained and add everything up for example in group
add everything up for example in group one we substract the the group average
one we substract the the group average of 2.5 from the first value from the
of 2.5 from the first value from the second value and so on in total we get a
second value and so on in total we get a sum of squares within the groups of 36
sum of squares within the groups of 36 so now we've almost made it we've
so now we've almost made it we've calculated the total mean and the group
calculated the total mean and the group mean values we have calculated the sum
mean values we have calculated the sum of squares between the groups and within
of squares between the groups and within the groups and now we can calculate the
the groups and now we can calculate the variances within and between the groups
variances within and between the groups in most cases however the term variance
in most cases however the term variance is not used but mean squares which we
is not used but mean squares which we will now also adopt we obtain the mean
will now also adopt we obtain the mean squares between the groups by dividing
squares between the groups by dividing the sum of squares between the groups by
the sum of squares between the groups by the number of degrees of freedom between
the number of degrees of freedom between the groups we've just calculated the sum
the groups we've just calculated the sum of squares between the groups and the
of squares between the groups and the degrees of freedom result from the
degrees of freedom result from the number of groups minus one so in our
number of groups minus one so in our case 3 - 1 which is equal to 2 we get a
case 3 - 1 which is equal to 2 we get a mean square of 60.6 7 so now we need the
mean square of 60.6 7 so now we need the mean squares within the groups these are
mean squares within the groups these are obtained by dividing the calculated sum
obtained by dividing the calculated sum of squares with in the groups by the
of squares with in the groups by the degrees of freedom within the groups the
degrees of freedom within the groups the degrees of freedom result from the total
degrees of freedom result from the total number of values in our case 24 minus
number of values in our case 24 minus the number of groups in our case three
the number of groups in our case three so we get
so we get 21 now we can calculate the F value to
21 now we can calculate the F value to do this we divide the mean Square
do this we divide the mean Square between the groups by the mean Square
between the groups by the mean Square within the groups or if we take the
within the groups or if we take the expression from the beginning the
expression from the beginning the variance between the groups by the
variance between the groups by the variance within the groups we get an F
variance within the groups we get an F value of
value of 3539 but what about the P value we can
3539 but what about the P value we can calculate the P value with an F
calculate the P value with an F distribution to do this we simply go to
distribution to do this we simply go to data.net we need the degrees of freedom
data.net we need the degrees of freedom between the groups in our case two the
between the groups in our case two the degrees of freedom within the groups in
degrees of freedom within the groups in our case 21 and we need the F value
our case 21 and we need the F value which is calculated I.E
which is calculated I.E 3539 we get a P value that is smaller
3539 we get a P value that is smaller than
than 0.01 so if we use the usual significance
0.01 so if we use the usual significance level of 5% we get a significant
level of 5% we get a significant difference and we reject the null
difference and we reject the null hypothesis that there is no difference
hypothesis that there is no difference between the three groups of course you
between the three groups of course you can simply calculate an anova online
can simply calculate an anova online with data tab just copy your data into
with data tab just copy your data into this table and click on Hy hypothesis
this table and click on Hy hypothesis test then simply select your desired
test then simply select your desired variables for example blood pressure and
variables for example blood pressure and medication and an NOA will now be
medication and an NOA will now be calculated automatically if you don't
calculated automatically if you don't know exactly how to interpret the
know exactly how to interpret the results simply click on summary inverts
results simply click on summary inverts or you can simply click on AI
or you can simply click on AI interpretation for most tables in this
interpretation for most tables in this table we now get exactly the values that
table we now get exactly the values that we just calculated by hand and we can
we just calculated by hand and we can also see here that we get a significant
also see here that we get a significant result as I said you can simply click on
result as I said you can simply click on AI interpretation to help you understand
AI interpretation to help you understand the results now there's one more topic
the results now there's one more topic to cover the post talk test a post talk
to cover the post talk test a post talk test becomes necessary when an anova
test becomes necessary when an anova indicates a significant difference
indicates a significant difference between groups while a NOA can confirm
between groups while a NOA can confirm that differences exist it doesn't
that differences exist it doesn't specify which groups are different from
specify which groups are different from each other a post talk test helps to
each other a post talk test helps to pinpoint exactly which groups differ
pinpoint exactly which groups differ from each other it performs a pairwise
from each other it performs a pairwise comparison between the groups indicating
comparison between the groups indicating for example whether group a differs from
for example whether group a differs from group b or group C in our case the post
group b or group C in our case the post talk test reveals that all groups differ
talk test reveals that all groups differ significantly from one another however
significantly from one another however this isn't always the outcome it's
this isn't always the outcome it's possible that only group one differs
possible that only group one differs from group two or that no individual
from group two or that no individual groups shows a significant difference
groups shows a significant difference even if the unov result is significant
even if the unov result is significant overall but what if you have not just
overall but what if you have not just one factor but two in this case you will
one factor but two in this case you will need to perform a two-way
need to perform a two-way Anova let's explore the two-way Anova
Anova let's explore the two-way Anova what is a two-way Anova a two-way Anova
what is a two-way Anova a two-way Anova is a statistical method used to test the
is a statistical method used to test the effect of two categorical variables on a
effect of two categorical variables on a continuous variable the categorical
continuous variable the categorical variables are the independent variables
variables are the independent variables for example the variable drag type with
for example the variable drag type with drug A and B and gender with female and
drug A and B and gender with female and male and the continuous variable is the
male and the continuous variable is the dependent variable for example the
dependent variable for example the reduction in blood pressure so the
reduction in blood pressure so the two-way Anova is the extension of the
two-way Anova is the extension of the one-way Anova while a oneway Anova tests
one-way Anova while a oneway Anova tests the effect of a single independent
the effect of a single independent variable on a dependent variable a twv
variable on a dependent variable a twv Anova tests the effects of two
Anova tests the effects of two independent variables the independent
independent variables the independent variables are called factors but what is
variables are called factors but what is a factor a factor is for example gender
a factor a factor is for example gender of a person with the levels male and
of a person with the levels male and female type of therapy with therapy a b
female type of therapy with therapy a b and c or the field of study with
and c or the field of study with medicine business administration
medicine business administration psychology and Mathematics in an
psychology and Mathematics in an analysis of variance a factor is
analysis of variance a factor is therefore a nominal variable we use an
therefore a nominal variable we use an anova whenever we want to test whether
anova whenever we want to test whether these levels have an influence on the
these levels have an influence on the so-call dependent variable you might
so-call dependent variable you might want to test whether gender has an
want to test whether gender has an effect on salary whether therapy has an
effect on salary whether therapy has an effect on blood pressure or whether
effect on blood pressure or whether field of study has an effect on length
field of study has an effect on length of study celer blood pressure and length
of study celer blood pressure and length of study will then be the dep dependent
of study will then be the dep dependent variables in each of these cases you
variables in each of these cases you test whether the factor has an effect on
test whether the factor has an effect on the dependent variable since you only
the dependent variable since you only have one factor in these cases you would
have one factor in these cases you would use a oneway Anova okay you're right in
use a oneway Anova okay you're right in the first case we have a variable with
the first case we have a variable with only two categories so of course we
only two categories so of course we would use the independent samples T Test
would use the independent samples T Test but when do we use a two-way Anova we
but when do we use a two-way Anova we use a two factor analysis of variance
use a two factor analysis of variance when we have a second factor and we want
when we have a second factor and we want to know whether this Factor also has an
to know whether this Factor also has an effect on the dependent variable we
effect on the dependent variable we would also like to know whether in
would also like to know whether in addition to gender the highest level of
addition to gender the highest level of education has an impact on salary or we
education has an impact on salary or we would like to include gender in addition
would like to include gender in addition to type of therapy or in the third case
to type of therapy or in the third case we would also like to know whether the
we would also like to know whether the university attended in addition to the
university attended in addition to the field of study has an in fluence on the
field of study has an in fluence on the length of study now we don't have one
length of study now we don't have one factor in all three cases but two
factor in all three cases but two factors in each case and since we now
factors in each case and since we now have two factors we use a two-way
have two factors we use a two-way analysis of variance so in a one-way
analysis of variance so in a one-way Anova we have one factor from which we
Anova we have one factor from which we create the groups if the factor we're
create the groups if the factor we're looking at has three levels for example
looking at has three levels for example three different types of drug we will
three different types of drug we will have three groups to compare in the the
have three groups to compare in the the case of a two-way analysis of variance
case of a two-way analysis of variance the group results from the combination
the group results from the combination of the levels of the two factors if we
of the levels of the two factors if we have one factor with three levels and
have one factor with three levels and one with two levels we have a total of
one with two levels we have a total of six groups to compare but what kind of
six groups to compare but what kind of statements can we make with a two-way
statements can we make with a two-way Anova with the help of a two-way Anova
Anova with the help of a two-way Anova we can answer three things whether the
we can answer three things whether the first Factor has an effect on the
first Factor has an effect on the dependent variable whether the second
dependent variable whether the second Factor has an effect on the dependent
Factor has an effect on the dependent variable and whether there is an
variable and whether there is an interaction effect between the two
interaction effect between the two factors but what about the hypotheses in
factors but what about the hypotheses in a two-way Anova there are three null
a two-way Anova there are three null hypotheses and therefore also three
hypotheses and therefore also three alternative hypothesis the first null
alternative hypothesis the first null hypothesis is there is no significant
hypothesis is there is no significant difference between the groups of the
difference between the groups of the first factor and the alternative
first factor and the alternative hypothesis there is a significant
hypothesis there is a significant difference between the groups of the
difference between the groups of the first Factor the second Nile hypothesis
first Factor the second Nile hypothesis is there is no significant difference
is there is no significant difference between the groups of the second factor
between the groups of the second factor and the alternative hypothesis there is
and the alternative hypothesis there is a significant difference between the
a significant difference between the groups of the second factor and the
groups of the second factor and the third null hypothesis reflects the
third null hypothesis reflects the interaction Effect one factor has no
interaction Effect one factor has no effect on the fact of the other factor
effect on the fact of the other factor and the alternative hypothesis at least
and the alternative hypothesis at least one fact Factor has an influence on the
one fact Factor has an influence on the effect of the other factor and what
effect of the other factor and what about the assumptions for the test
about the assumptions for the test results to be valid several assumptions
results to be valid several assumptions must be met number one normality the
must be met number one normality the data within the groups should be
data within the groups should be normally distributed or alternatively
normally distributed or alternatively the residual should be normally
the residual should be normally distributed this can be checked with a
distributed this can be checked with a quantile quantile plot Number Two
quantile quantile plot Number Two homogeneity of variances the variance of
homogeneity of variances the variance of data in groups should be equal this can
data in groups should be equal this can be checked with the Lin test number
be checked with the Lin test number three Independence the measurements
three Independence the measurements should be independent I.E the measured
should be independent I.E the measured value of one group should not be
value of one group should not be influenced by the measured value of
influenced by the measured value of another group number four measurement
another group number four measurement level the dependent variable should have
level the dependent variable should have a metric scale level but how to
a metric scale level but how to calculate a two-way Anova let's look at
calculate a two-way Anova let's look at the example from the beginning we would
the example from the beginning we would like to know if Dr type and gender have
like to know if Dr type and gender have an influence on the reduction in blood
an influence on the reduction in blood pressure Dr type has the two levels drug
pressure Dr type has the two levels drug A and B and gender has the two levels
A and B and gender has the two levels male and female to answer the question
male and female to answer the question we collect data we randomly assigned
we collect data we randomly assigned patients to the treatment combinations
patients to the treatment combinations and measured the reduction in blood
and measured the reduction in blood pressure after a month for example the
pressure after a month for example the first patient receives track a is male
first patient receives track a is male and after 1 month a reduction in blood
and after 1 month a reduction in blood pressure of six was measured now let us
pressure of six was measured now let us answer the questions is there a main
answer the questions is there a main effect of drag type on the reduction in
effect of drag type on the reduction in blood pressure is there a main effect of
blood pressure is there a main effect of gender on the reduction in blood
gender on the reduction in blood pressure and is there an interaction
pressure and is there an interaction effect between drag type and gender on
effect between drag type and gender on the reduction in blood pressure for the
the reduction in blood pressure for the calculation we can use either a
calculation we can use either a statistical software like data tab or do
statistical software like data tab or do it by hand first I will show you how to
it by hand first I will show you how to calculate it with data Tab and how to
calculate it with data Tab and how to interpret the results at the end I will
interpret the results at the end I will show you how to calculate the Anova by
show you how to calculate the Anova by hand and go through all the equations to
hand and go through all the equations to calculate a two-way Anova online simply
calculate a two-way Anova online simply visit data.net and copy your data into
visit data.net and copy your data into this table then click on hypothesis test
this table then click on hypothesis test under this tab you will find a lot of
under this tab you will find a lot of hypothesis tests and depending on which
hypothesis tests and depending on which variable able you select you will get an
variable able you select you will get an appropriate hypothesis test suggested we
appropriate hypothesis test suggested we want to know if drag type and gender
want to know if drag type and gender have an influence on the reduction in
have an influence on the reduction in blood pressure so let's just click on
blood pressure so let's just click on all three variables data now
all three variables data now automatically gives us a two-way Anova
automatically gives us a two-way Anova we can read the three null and the three
we can read the three null and the three alternative hypotheses here afterwards
alternative hypotheses here afterwards we get the descriptive statistics and
we get the descriptive statistics and the LaVine test for equal variance with
the LaVine test for equal variance with the LaVine test we can check if the
the LaVine test we can check if the variances within the groups are equal
variances within the groups are equal the P value is greater than
the P value is greater than 0.05 so we assume equality of variance
0.05 so we assume equality of variance in the groups for this data and here we
in the groups for this data and here we see the results of the analysis of
see the results of the analysis of variance we'll look at these in more
variance we'll look at these in more detail in a moment but if you don't know
detail in a moment but if you don't know exactly how to interpret the results you
exactly how to interpret the results you can also just click on summary in words
can also just click on summary in words in addition you can check here if the
in addition you can check here if the requirements for the analysis of
requirements for the analysis of variance are met at all but now back to
variance are met at all but now back to the results let's take a closer look at
the results let's take a closer look at this table the first row tests the N
this table the first row tests the N hypothesis where drag type has an effect
hypothesis where drag type has an effect on the reduction in blood pressure the
on the reduction in blood pressure the second row tests whether gender has an
second row tests whether gender has an effect on the reduction in blood
effect on the reduction in blood pressure and the third row tests if the
pressure and the third row tests if the interaction has an effect you you can
interaction has an effect you you can read the P value in each case right at
read the P value in each case right at the back here let's say we set the
the back here let's say we set the significance level at 5% if our
significance level at 5% if our calculated P value is less than
calculated P value is less than 0.05 the null hypothesis is rejected and
0.05 the null hypothesis is rejected and if the calculated P value is greater
if the calculated P value is greater than
than 0.05 the N hypothesis is not rejected
0.05 the N hypothesis is not rejected thus we see that all three p values are
thus we see that all three p values are greater than
greater than 0.05 and therefore we cannot not reject
0.05 and therefore we cannot not reject any of the three null hypotheses
any of the three null hypotheses therefore neither the drug type nor
therefore neither the drug type nor gender have a significant effect on the
gender have a significant effect on the reduction in blood pressure and there's
reduction in blood pressure and there's also no significant interaction effect
also no significant interaction effect but what does an analysis of variance
but what does an analysis of variance actually do and why is the word variance
actually do and why is the word variance in analysis of variance in a two-way
in analysis of variance in a two-way analysis of variance the total variance
analysis of variance the total variance of the dependent variable is divided
of the dependent variable is divided into the variance that can be explained
into the variance that can be explained by factor a the variance that can be
by factor a the variance that can be explained by Factor B the variance of
explained by Factor B the variance of the interaction and the arrow variance
the interaction and the arrow variance actually SS is not the variance but the
actually SS is not the variance but the sum of squares we will discuss how to
sum of squares we will discuss how to calculate the variance in this case in a
calculate the variance in this case in a moment but how can I imagine that the
moment but how can I imagine that the dependent variable has some variance in
dependent variable has some variance in our example not every everyone will have
our example not every everyone will have the same reduction in blood pressure we
the same reduction in blood pressure we now want to know if we can explain some
now want to know if we can explain some of this variance by the variabl Str type
of this variance by the variabl Str type gender and their interaction the part
gender and their interaction the part that we cannot explain by these three
that we cannot explain by these three terms accumulates in the eror if the
terms accumulates in the eror if the result looked like this we would be able
result looked like this we would be able to explain almost all the variance by
to explain almost all the variance by factors A and B and their interaction
factors A and B and their interaction and we would only have a very small
and we would only have a very small proportion that could not be explained
proportion that could not be explained this means that we can make a very good
this means that we can make a very good statement about the reduction in blood
statement about the reduction in blood pressure by the variable drag type sex
pressure by the variable drag type sex and interaction in this case it would be
and interaction in this case it would be the other way around drug type gender
the other way around drug type gender and the interaction almost have no
and the interaction almost have no effect on the reduction in blood
effect on the reduction in blood pressure and it all adds up in the arrow
pressure and it all adds up in the arrow but how do we calculate the sum of
but how do we calculate the sum of squares the F value and the P value
squares the F value and the P value here we have our data one's drug type
here we have our data one's drug type with drug A and B and one's gender with
with drug A and B and one's gender with male and female so these individuals are
male and female so these individuals are for example all male and have been given
for example all male and have been given drug a first we calculate the mean
drug a first we calculate the mean values we need we calculate the mean
values we need we calculate the mean value of each group so male and drag a
value of each group so male and drag a that is 5.8 then male and Drug B that is
that is 5.8 then male and Drug B that is 5.4 and we do the same for female then
5.4 and we do the same for female then we calculate the mean value of all males
we calculate the mean value of all males and females and the mean value of drug A
and females and the mean value of drug A and B finally we need the total mean we
and B finally we need the total mean we can now start to calculate the sum of
can now start to calculate the sum of squares let's start with the total sum
squares let's start with the total sum of squares we do this by subtracting the
of squares we do this by subtracting the total mean from each individual value
total mean from each individual value squaring the result and adding up all
squaring the result and adding up all the values the total mean is 5. 4 so we
the values the total mean is 5. 4 so we calculate 6 - 5.4 2ar + 4 - 5.4 squared
calculate 6 - 5.4 2ar + 4 - 5.4 squared to finally 3 - 5.4 squared so we get a
to finally 3 - 5.4 squared so we get a sum of squares of
sum of squares of 84.8 the degrees of freedom are given by
84.8 the degrees of freedom are given by n * P * Q - 1 n is the number of people
n * P * Q - 1 n is the number of people in a group in our case five and P and Q
in a group in our case five and P and Q are the number of categories in each of
are the number of categories in each of the factors in both cases we have two
the factors in both cases we have two groups the total variance is calculated
groups the total variance is calculated by dividing the sum of squares by the
by dividing the sum of squares by the degrees of freedom so we get
degrees of freedom so we get 4.46 now we can calculate the sum of
4.46 now we can calculate the sum of squares between the groups for this we
squares between the groups for this we calculate the group mean minus the total
calculate the group mean minus the total mean so 5.8 - 5.4 sared + 5. 4 - 5.4
mean so 5.8 - 5.4 sared + 5. 4 - 5.4 sared and the same for these two values
sared and the same for these two values we get
we get 7.6 in this case the degrees of freedom
7.6 in this case the degrees of freedom are three which gives us a variance of
are three which gives us a variance of 2.53 now we can calculate the sum of
2.53 now we can calculate the sum of squares of factor a a dash is the mean
squares of factor a a dash is the mean value of the categories of factor a so
value of the categories of factor a so we calculate
we calculate 5.9 minus the total mean value and and
5.9 minus the total mean value and and 4.9 minus the total mean value this
4.9 minus the total mean value this results in five together with the
results in five together with the degrees of freedom we can now calculate
degrees of freedom we can now calculate the variance for factor a which is five
the variance for factor a which is five we do the same for Factor B in this case
we do the same for Factor B in this case we use the mean values of male and
we use the mean values of male and female and we get the variance of
female and we get the variance of 0.8 now we can calculate the sum of
0.8 now we can calculate the sum of squares for the interaction we obtain
squares for the interaction we obtain this by calculating the sum of squares
this by calculating the sum of squares minus the sum of squares of A and B the
minus the sum of squares of A and B the degrees of freedom result to one for the
degrees of freedom result to one for the interaction we get a variance of
interaction we get a variance of 1.8 finally we can calculate the sum of
1.8 finally we can calculate the sum of squares of the error we substract the
squares of the error we substract the mean value of each group from the
mean value of each group from the respective group values so in this group
respective group values so in this group we subtract 5.8 from each individual
we subtract 5.8 from each individual value in this group we subtract
value in this group we subtract 5.4 here we substract six and then we
5.4 here we substract six and then we substract 4.4 this gives us a sum of
substract 4.4 this gives us a sum of squares of
squares of 77.2 the degrees of freedom are 16 and
77.2 the degrees of freedom are 16 and we get a variance of
we get a variance of 4.83 and now we calculate the F values
4.83 and now we calculate the F values these are obtained by dividing the
these are obtained by dividing the variance of factor a b or the
variance of factor a b or the interaction by the arrow variance so we
interaction by the arrow variance so we get the F value for factor a by dividing
get the F value for factor a by dividing in the variance of factor a by the eror
in the variance of factor a by the eror variance which is equal to
variance which is equal to 1.04 we can now do exactly the same for
1.04 we can now do exactly the same for FB and faab to verify we get exactly the
FB and faab to verify we get exactly the same values with data tab
same values with data tab 1.04
1.04 0.17 and
0.17 and 0.37 for the calculation of the P value
0.37 for the calculation of the P value you need the degrees of freedom and the
you need the degrees of freedom and the F distribution so with these three
F distribution so with these three values you can either read the critical
values you can either read the critical P value in a table or as usual you just
P value in a table or as usual you just use a software to calculate the P values
use a software to calculate the P values you can find a table of critical F
you can find a table of critical F values on data tab a g for a
values on data tab a g for a significance level of 5% you can use
significance level of 5% you can use this table if the red value is greater
this table if the red value is greater than the calculated F value the null
than the calculated F value the null hypothesis is rejected otherwise not
hypothesis is rejected otherwise not we've seen how an NOA allows us to
we've seen how an NOA allows us to compare PA means across different groups
compare PA means across different groups to determine if there are significant
to determine if there are significant differences but what if our research
differences but what if our research design involves measurements taken from
design involves measurements taken from the same subjects at different time
the same subjects at different time points this is where the pendency among
points this is where the pendency among the observations comes into play Let's
the observations comes into play Let's dive into how repeated measures and NOA
dive into how repeated measures and NOA adjusts our approach to interconnected
adjusts our approach to interconnected data points this video is about the
data points this video is about the repeated measures and NOA we will go
repeated measures and NOA we will go through the following questions what is
through the following questions what is a repeated measures analysis of variance
a repeated measures analysis of variance what are the hypotheses and the
what are the hypotheses and the assumptions how is a repeated measures
assumptions how is a repeated measures and over calculated how are the results
and over calculated how are the results interpreted and what is a post talk test
interpreted and what is a post talk test and how do you interpret it we'll go
and how do you interpret it we'll go through all points using a simple
through all points using a simple example let's start with the first
example let's start with the first question what is a repeated measures
question what is a repeated measures Anova a repeated measures an is of
Anova a repeated measures an is of variance tests whether there is a
variance tests whether there is a statistically significant difference
statistically significant difference between three or more dependent samples
between three or more dependent samples what are dependent samples in a
what are dependent samples in a dependent sample the same participants
dependent sample the same participants are measured multiple times under
are measured multiple times under different conditions or at different
different conditions or at different time points we therefore have several
time points we therefore have several measurements from each person involved
measurements from each person involved let's take a look at an example let's
let's take a look at an example let's say we want to investigate the effect
say we want to investigate the effect effectiveness of a training program for
effectiveness of a training program for this we've started looking for
this we've started looking for volunteers to participate in order to
volunteers to participate in order to investigate the effectiveness of the
investigate the effectiveness of the program we measure the physical fitness
program we measure the physical fitness of the participants at several points in
of the participants at several points in time before the training program
time before the training program immediately after completion and two
immediately after completion and two months later so for each participant we
months later so for each participant we have a value for physical fitness before
have a value for physical fitness before the program a value immediately after
the program a value immediately after after completion and a value 2 months
after completion and a value 2 months later and since we are measuring the
later and since we are measuring the same participants at different points in
same participants at different points in time we are dealing with dependent
time we are dealing with dependent samples now of course it doesn't have to
samples now of course it doesn't have to be about people or points in time in a
be about people or points in time in a generalized way we can say in a
generalized way we can say in a dependent sample the same test units are
dependent sample the same test units are measured several times under different
measured several times under different conditions the test units can be people
conditions the test units can be people animals or cells for example
animals or cells for example and the conditions can be time points or
and the conditions can be time points or treatments for example but what is the
treatments for example but what is the purpose of repeated measures and over we
purpose of repeated measures and over we want to know whether the fitness program
want to know whether the fitness program has an influence on physical fitness and
has an influence on physical fitness and it is precisely this question that we
it is precisely this question that we can answer with the help of an anova
can answer with the help of an anova with repeated measures physical fitness
with repeated measures physical fitness is therefore our dependent variable and
is therefore our dependent variable and time is our independent variable with
time is our independent variable with time points as levels so the analysis of
time points as levels so the analysis of variance with repeated measures checks
variance with repeated measures checks whether there is a significant
whether there is a significant difference between the different time
difference between the different time points but isn't that what the paired
points but isn't that what the paired samples T Test does doesn't it also test
samples T Test does doesn't it also test whether there is a difference between
whether there is a difference between dependent samples that's correct the
dependent samples that's correct the paired samples T Test evaluates whether
paired samples T Test evaluates whether there is a difference between two
there is a difference between two dependent groups the repeated measures a
dependent groups the repeated measures a Nova extends this concept allow allowing
Nova extends this concept allow allowing you to examine differences among three
you to examine differences among three or more dependent groups what are the
or more dependent groups what are the hypotheses for repeated measures and NOA
hypotheses for repeated measures and NOA the null hypothesis for a repeated
the null hypothesis for a repeated measures a Nova is that there are no
measures a Nova is that there are no differences between the means of the
differences between the means of the different conditions or time points in
different conditions or time points in other words the null hypothesis assumes
other words the null hypothesis assumes that each person has the same value at
that each person has the same value at all times the values of the individual
all times the values of the individual persons themselves May differ but one
persons themselves May differ but one the same person always has the same
the same person always has the same value the alternative hypothesis on the
value the alternative hypothesis on the other hand is that there is a difference
other hand is that there is a difference between the dependent groups in our
between the dependent groups in our example the null hypothesis states that
example the null hypothesis states that the training program has no influence on
the training program has no influence on physical fitness I.E that physical
physical fitness I.E that physical fitness does not change over time and
fitness does not change over time and the alternative hypothesis assumes that
the alternative hypothesis assumes that the training program does have an
the training program does have an influence I.E that physical fitness
influence I.E that physical fitness changes over time to correctly apply
changes over time to correctly apply repeated measures and over certain
repeated measures and over certain assumptions about the data must be
assumptions about the data must be fulfilled number one normality the
fulfilled number one normality the dependent variable should be
dependent variable should be approximately normally distributed this
approximately normally distributed this can be tested using the QQ plot or the
can be tested using the QQ plot or the karov smof test for more information
karov smof test for more information please watch my video test for normal
please watch my video test for normal distribution you can find the link in
distribution you can find the link in the video description
the video description number two sphericity the variances of
number two sphericity the variances of the differences between all combinations
the differences between all combinations of factor levels or time points should
of factor levels or time points should be the same this can be tested with the
be the same this can be tested with the help of Marley's test for sphericity if
help of Marley's test for sphericity if the resulting P value is greater than
the resulting P value is greater than 0.05 we can assume that the variances
0.05 we can assume that the variances are equal and the assumption is not
are equal and the assumption is not violated in this case the P value is
violated in this case the P value is greater than
greater than 0.05 then therefore this assumption is
0.05 then therefore this assumption is fulfilled if the assumption is violated
fulfilled if the assumption is violated adjustments such as Greenhouse Geer or
adjustments such as Greenhouse Geer or hind F can be made now I'll show you how
hind F can be made now I'll show you how to calculate and interpret an analysis
to calculate and interpret an analysis of variance online with data Tab and
of variance online with data Tab and then we'll go through the formulas to
then we'll go through the formulas to explain how to calculate the analysis of
explain how to calculate the analysis of variance with repeated measures by hand
variance with repeated measures by hand to calculate an anova online you simply
to calculate an anova online you simply go to data.net and copy your own data
go to data.net and copy your own data into this table I use this example data
into this table I use this example data set you can find a link to load this
set you can find a link to load this example data set in the video
example data set in the video description make sure that your data is
description make sure that your data is structured correctly I.E one row per
structured correctly I.E one row per participant and one column per condition
participant and one column per condition or time now we click on the hypothesis
or time now we click on the hypothesis test tab at the bottom we see the three
test tab at the bottom we see the three variables before in the middle and end
variables before in the middle and end from the data set if we now click on all
from the data set if we now click on all of them a repeated measures an NOA is
of them a repeated measures an NOA is automatically calculated firstly we can
automatically calculated firstly we can check the assumptions here we see that
check the assumptions here we see that the Mist test for sphericity results in
the Mist test for sphericity results in a P value of
a P value of 0.357 this value is greater than
0.357 this value is greater than 0.05 and thus the assumption is
0.05 and thus the assumption is fulfilled if this is not the case you
fulfilled if this is not the case you can take sphericity
correction I will explain how to test the normal distribution in a separate
the normal distribution in a separate video in our example now we will assume
video in our example now we will assume normal distribution with a lot of coning
normal distribution with a lot of coning of teeth if the assumption is not
of teeth if the assumption is not fulfilled you can simply calculate the
fulfilled you can simply calculate the non-parametric counterpart to the
non-parametric counterpart to the repeated measures and NOA the fredman
repeated measures and NOA the fredman test this does not require your data to
test this does not require your data to be normally
be normally distributed first of all if you do not
distributed first of all if you do not know exactly how to interpret the
know exactly how to interpret the individual tables in your analysis you
individual tables in your analysis you you can simply click on summary in words
you can simply click on summary in words or on AI interpretation for the tables
or on AI interpretation for the tables but now back to the results first we see
but now back to the results first we see the null and the alternative hypothesis
the null and the alternative hypothesis the null hypothesis is that there is no
the null hypothesis is that there is no difference between the dependent
difference between the dependent variables before in the middle and end
variables before in the middle and end and the alternative hypothesis is that
and the alternative hypothesis is that there is a difference at the end of the
there is a difference at the end of the test we can say whether we reject this
test we can say whether we reject this null hypothesis or not
null hypothesis or not now we see the descriptive statistics
now we see the descriptive statistics and a box plot we then get the results
and a box plot we then get the results of the Anova with repeated measures in
of the Anova with repeated measures in this table the P value is the most
this table the P value is the most important value it is
important value it is 0.01 and indicates the probability that
0.01 and indicates the probability that a sample deviates as much or even more
a sample deviates as much or even more from the null hypothesis as our sample
from the null hypothesis as our sample with a P value of
with a P value of 0.01 the results are statistically
0.01 the results are statistically significant at the conventional
significant at the conventional significance level of
significance level of 0.05 which means that there are
0.05 which means that there are significant differences between the mean
significant differences between the mean values of the three levels before in the
values of the three levels before in the middle and end this rejects the null
middle and end this rejects the null hypothesis and we assume that there is a
hypothesis and we assume that there is a difference between the groups and that
difference between the groups and that the training program or therapy has a
the training program or therapy has a significant effect if you want an
significant effect if you want an interpretation of the other values in
interpretation of the other values in this table simply click on AI
this table simply click on AI interpretation finally here is the table
interpretation finally here is the table for the bonferoni posst Haw test since
for the bonferoni posst Haw test since the p value of the analysis of variance
the p value of the analysis of variance is smaller than
is smaller than 0.05 we know that there is a difference
0.05 we know that there is a difference between one or more groups with the POs
between one or more groups with the POs hog test we can now determine between
hog test we can now determine between which groups this different exists we
which groups this different exists we see that there is a significant
see that there is a significant difference between before and end and in
difference between before and end and in the midle middle and end both have a P
the midle middle and end both have a P value of less than
value of less than 0.05 how do you calculate an analysis of
0.05 how do you calculate an analysis of variance with repeated measures by hand
variance with repeated measures by hand let's say this is our data we have five
let's say this is our data we have five people each of whom we measured at three
people each of whom we measured at three different points in time now we can
different points in time now we can calculate the necessary mean values
calculate the necessary mean values first we calculate the mean value of all
first we calculate the mean value of all the data which is
the data which is 5.4 then we calculate the mean value Val
5.4 then we calculate the mean value Val of the three groups for the first groups
of the three groups for the first groups we get a mean value of five for the
we get a mean value of five for the second a value of 6.1 and for the third
second a value of 6.1 and for the third a value of
a value of 5.1 and finally we can calculate the
5.1 and finally we can calculate the mean value of the three measurements for
mean value of the three measurements for each person so for the first person for
each person so for the first person for example we have an average value of
example we have an average value of eight over the three measurements and
eight over the three measurements and for the last person we have an average
for the last person we have an average value of five now that we have all all
value of five now that we have all all the mean values we need to calculate the
the mean values we need to calculate the required sums of squares but note our
required sums of squares but note our goal is the so-called F value and
goal is the so-called F value and subsequently calculate a P value from it
subsequently calculate a P value from it there are different ways for getting
there are different ways for getting this F value I will demonstrate one
this F value I will demonstrate one common way how to do this depending on
common way how to do this depending on which statistics textbook you use you
which statistics textbook you use you may come across a different formula but
may come across a different formula but back to the calculation let's start with
back to the calculation let's start with the sum of squares within the subject we
the sum of squares within the subject we obtain this by calculating each
obtain this by calculating each individual value
individual value xmi minus the mean value of the
xmi minus the mean value of the respective subject squaring this and
respective subject squaring this and adding it up so we start with 7 - 8 2ar
adding it up so we start with 7 - 8 2ar + 9 - 8 2ar until finally 3 - 5 and 7 -
+ 9 - 8 2ar until finally 3 - 5 and 7 - 5 we can then calculate the sum of
5 we can then calculate the sum of squares of the treatment I.E the sum of
squares of the treatment I.E the sum of squares of the three points in time we
squares of the three points in time we obtain this by subtracting the total
obtain this by subtracting the total mean value from each group mean value
mean value from each group mean value squaring it and adding it Up N is the
squaring it and adding it Up N is the number of people in a group so we get 5
number of people in a group so we get 5 - 5.4 sared + 6.1 - 5.4 sared + 5.1 -
- 5.4 sared + 6.1 - 5.4 sared + 5.1 - 5.4 squar now we can calculate the sum
5.4 squar now we can calculate the sum of squares of the residual we get this
of squares of the residual we get this by simply calculating the sum of squares
by simply calculating the sum of squares within the subjects minus the sum of
within the subjects minus the sum of squares of the treatment alternatively
squares of the treatment alternatively we can also use this formula here xmi is
we can also use this formula here xmi is again the value of each individual
again the value of each individual person AI is the mean value of the
person AI is the mean value of the respective group PM is the mean value of
respective group PM is the mean value of the respective person of the three
the respective person of the three points in time and G is the total mean
points in time and G is the total mean value we can then calculate the mean
value we can then calculate the mean squares to do this we divide the
squares to do this we divide the respective sum of squares by the degrees
respective sum of squares by the degrees of freedom the mean square of the
of freedom the mean square of the treatment is therefore calculated by
treatment is therefore calculated by dividing the sum of squares of the
dividing the sum of squares of the treatment by the degrees of freedom of
treatment by the degrees of freedom of the treatment the degrees of freedom of
the treatment the degrees of freedom of the treatment are the number of factor
the treatment are the number of factor levels minus one so we have three time
levels minus one so we have three time points minus one which is two the mean
points minus one which is two the mean Square of the residual is obtained in
Square of the residual is obtained in the same way here the degrees of freedom
the same way here the degrees of freedom are the number of factor levels minus 1
are the number of factor levels minus 1 times the number of subjects minus 1 we
times the number of subjects minus 1 we get 2 * 7 which is equal to 14 now we
get 2 * 7 which is equal to 14 now we calculate the F value which is done by
calculate the F value which is done by dividing the mean square of the
dividing the mean square of the treatment by the mean square of the
treatment by the mean square of the residual or error finally we calculate
residual or error finally we calculate the P value using the F value and the
the P value using the F value and the degrees of freedom from the treatment
degrees of freedom from the treatment and residual to calculate the P value
and residual to calculate the P value you can simply go to this page on data
you can simply go to this page on data tab the link can be found in the video
tab the link can be found in the video description here you can enter your
description here you can enter your values our F value is
values our F value is 1.69 the numerator degree of Freedom I.E
1.69 the numerator degree of Freedom I.E that of the treatments is two and the
that of the treatments is two and the denominator degree of Freedom I.E that
denominator degree of Freedom I.E that of the error is four 14 we get a P value
of the error is four 14 we get a P value of
of 0.22 the P value is greater than
0.22 the P value is greater than 0.05 and therefore we do not have enough
0.05 and therefore we do not have enough evidence to reject the null hypothesis
evidence to reject the null hypothesis of course we can then compare the
of course we can then compare the results with data tab to do this we copy
results with data tab to do this we copy the data back into this table and click
the data back into this table and click on the
on the variables we can see that we also get a
variables we can see that we also get a P value of 0.22 here after after
P value of 0.22 here after after exploring how repeated measures Anova
exploring how repeated measures Anova can be used to analyze data we might
can be used to analyze data we might wonder how to handle even more complex
wonder how to handle even more complex designs this is where mixed model Anova
designs this is where mixed model Anova comes in let's find out how this
comes in let's find out how this powerful tool can help us what is a
powerful tool can help us what is a mixed model Anova what are the
mixed model Anova what are the hypotheses and assumptions and how to
hypotheses and assumptions and how to interpret the results of a mixed model
interpret the results of a mixed model Anova this is what we discussed in this
Anova this is what we discussed in this video Let's let's start with the first
video Let's let's start with the first question what is a mixed model Anova a
question what is a mixed model Anova a mixed model Anova is a statistical
mixed model Anova is a statistical method used to analyze data that
method used to analyze data that involves both between subject factors
involves both between subject factors and within subject factors but what are
and within subject factors but what are between subjects factors and within
between subjects factors and within subjects factors let's look at an
subjects factors let's look at an example let's say we want to test
example let's say we want to test whether different diets have an effect
whether different diets have an effect on cholesterol levels we would like to
on cholesterol levels we would like to compare the three diets a b and c so the
compare the three diets a b and c so the factor diet has the three levels a b and
factor diet has the three levels a b and c to test whether there is a difference
c to test whether there is a difference between the diets we are conducting a
between the diets we are conducting a study with 18 participants the
study with 18 participants the individual participants are called
individual participants are called subjects now we randomly assign six
subjects now we randomly assign six participants to each of the three groups
participants to each of the three groups each participant or subject is assigned
each participant or subject is assigned to only one group in this case we have a
to only one group in this case we have a between subjects Factor difference
between subjects Factor difference subjects are exposed to different levels
subjects are exposed to different levels of a factor in this analysis our
of a factor in this analysis our objective is to determine whether
objective is to determine whether significant differences exist in the
significant differences exist in the mean cholesterol levels among the
mean cholesterol levels among the various groups on the study and this is
various groups on the study and this is exactly what a One Way Anova does now of
exactly what a One Way Anova does now of course we could also examine the impact
course we could also examine the impact of one diet across multiple time points
of one diet across multiple time points we could measure the cholesterol levels
we could measure the cholesterol levels at each participant at the start of the
at each participant at the start of the diet after 2 weeks and after 4 weeks so
diet after 2 weeks and after 4 weeks so the factor time has the three level
the factor time has the three level start two weeks and four weeks and in
start two weeks and four weeks and in this case the same subjects are being
this case the same subjects are being exposed to all levels of the factor and
exposed to all levels of the factor and this is called a within subjects Factor
this is called a within subjects Factor the same subjects are exposed to all
the same subjects are exposed to all levels or conditions in this case we
levels or conditions in this case we want to know if there is a difference in
want to know if there is a difference in the mean value of the cholesterol levels
the mean value of the cholesterol levels between the different points in time and
between the different points in time and this is exactly what a repeated measures
this is exactly what a repeated measures Anova does therefore in a repeated
Anova does therefore in a repeated measures Anova we have within subject
measures Anova we have within subject factors so in a between subjects design
factors so in a between subjects design each subject or participant is only
each subject or participant is only assigned to one factor level so that the
assigned to one factor level so that the different subjects only have the
different subjects only have the influence of the respective group in
influence of the respective group in contrast in the within subject design
contrast in the within subject design the same subjects op participants are
the same subjects op participants are exposed to all Factor levels which
exposed to all Factor levels which enables a direct comparison of the
enables a direct comparison of the reactions to each factor level but what
reactions to each factor level but what if we want to test if there is a
if we want to test if there is a difference between Diet a b and c over
difference between Diet a b and c over the different points in time so we want
the different points in time so we want to test if there is a difference between
to test if there is a difference between the diets and if there is a difference
the diets and if there is a difference between the different time points then
between the different time points then we need a mixed model and NOA because we
we need a mixed model and NOA because we have both one between subject factor and
have both one between subject factor and one within subject Factor so in a mixed
one within subject Factor so in a mixed model and NOA we have at least one
model and NOA we have at least one between subjects factor and at least one
between subjects factor and at least one within subjects factor in the same
within subjects factor in the same analysis note a mixed modan NOA is also
analysis note a mixed modan NOA is also called a two Way Anova with repeated
called a two Way Anova with repeated measures because there are two factors
measures because there are two factors and one of them results from repeated
and one of them results from repeated measures therefore the mixed model and
measures therefore the mixed model and Nova tests whether there is a difference
Nova tests whether there is a difference between more than two samples which are
between more than two samples which are divided at least between two factors one
divided at least between two factors one of the factors is a result of
of the factors is a result of measurement repetition with the help of
measurement repetition with the help of a mixed model and NOA you can now answer
a mixed model and NOA you can now answer three things first whether the within
three things first whether the within subject Factor has an effect on the
subject Factor has an effect on the dependent variable second whether the
dependent variable second whether the between subject Factor has an effect on
between subject Factor has an effect on a dependent variable and third whether
a dependent variable and third whether there is a so-called interaction effect
there is a so-called interaction effect between the two factors this gives us a
between the two factors this gives us a good transition to the hypothesis the
good transition to the hypothesis the first null hypothesis is the mean values
first null hypothesis is the mean values of the different measurement times do
of the different measurement times do not differ there are no significant
not differ there are no significant differences between the groups of the
differences between the groups of the within subject Factor then of course
within subject Factor then of course there is the second the means of the
there is the second the means of the different groups of the between subject
different groups of the between subject Factor do not differ and the third Nile
Factor do not differ and the third Nile hypothesis reflects the interaction
hypothesis reflects the interaction effect one factor has no effect on the
effect one factor has no effect on the effect of the other Factor what are the
effect of the other Factor what are the assumptions of a mixed model Nova
assumptions of a mixed model Nova normality the dependent variable should
normality the dependent variable should be approximately normally distributed
be approximately normally distributed within each group of the dependent
within each group of the dependent variables this assumption is especially
variables this assumption is especially important when a sample size is small
important when a sample size is small when a sample size is large and NOA is
when a sample size is large and NOA is somewhat robust to violations of
somewhat robust to violations of normality homogeneity of variances the
normality homogeneity of variances the variances in each group should be equal
variances in each group should be equal in mixed modela NOA this needs to be
in mixed modela NOA this needs to be true for both the within subjects and
true for both the within subjects and between subjects factors the leine stask
between subjects factors the leine stask can be used to check this assumption
can be used to check this assumption homogeneity of co-variances
homogeneity of co-variances sphericity this applies to the within
sphericity this applies to the within subjects factors and assumes that the
subjects factors and assumes that the variance of the differences between all
variance of the differences between all combinations of the different groups are
combinations of the different groups are equal what does that mean let's start
equal what does that mean let's start with the differences between all
with the differences between all combinations to do this we simply need
combinations to do this we simply need to calculate the difference of the first
to calculate the difference of the first group minus the second the difference of
group minus the second the difference of the first group and the third group and
the first group and the third group and the difference of the second group and
the difference of the second group and the third group these calculated
the third group these calculated differences should now have the same
differences should now have the same variance this assumption can be tested
variance this assumption can be tested using marless test of stcity when this
using marless test of stcity when this assumption is violated adjustments to
assumption is violated adjustments to the degrees of freedom such as
the degrees of freedom such as Greenhouse Geer or hind Feld can be used
Greenhouse Geer or hind Feld can be used independence of observations this
independence of observations this assumes that the observations are
assumes that the observations are independent of each other this is a
independent of each other this is a fundamental assumption in an NOA and is
fundamental assumption in an NOA and is usually assured by the study design no
usually assured by the study design no significant outliers outliers can have a
significant outliers outliers can have a disproportionate effect on Anova
disproportionate effect on Anova potentially causing misleading results
potentially causing misleading results it's important to identify and address
it's important to identify and address outliers let let's calculate an example
outliers let let's calculate an example and I'll show you how to interpret the
and I'll show you how to interpret the results let's say this is our data that
results let's say this is our data that we want to analyze we want to know
we want to analyze we want to know whether therapy a b and c and three
whether therapy a b and c and three different time points have an effect on
different time points have an effect on cholesterol levels each row is one
cholesterol levels each row is one person the therapy is the between
person the therapy is the between subject factor and the time with the
subject factor and the time with the levels before the therapy in the middle
levels before the therapy in the middle and at the end of the therapy is the
and at the end of the therapy is the within subject Factor
within subject Factor so the first patient on therapy a had a
so the first patient on therapy a had a cholesterol level of 165 before therapy
cholesterol level of 165 before therapy a cholesterol level of 145 in the middle
a cholesterol level of 145 in the middle and a cholesterol level of 140 at the
and a cholesterol level of 140 at the end let's first calculate the example
end let's first calculate the example online with data Tab and then discuss
online with data Tab and then discuss how to interpret the results to
how to interpret the results to calculate an anova online simply go to
calculate an anova online simply go to data.net and copy your data into this
data.net and copy your data into this table table you can also load this
table table you can also load this example data set using the link in the
example data set using the link in the video description then click on
video description then click on hypothesis testing under this tab you
hypothesis testing under this tab you will find a lot of hypothesis tests and
will find a lot of hypothesis tests and depending on which variable you click on
depending on which variable you click on you will get an appropriate hypothesis
you will get an appropriate hypothesis test suggested if you copy your data up
test suggested if you copy your data up here the variables will appear down here
here the variables will appear down here if the correct scale level is not
if the correct scale level is not automatically detected you can simply
automatically detected you can simply simply change it on variables view for
simply change it on variables view for example if we click on before middle and
example if we click on before middle and end a repeated measur Anova is
end a repeated measur Anova is automatically
automatically calculated but we also want to include
calculated but we also want to include the therapy so we just click on therapy
the therapy so we just click on therapy now we get a mixed model Anova we can
now we get a mixed model Anova we can read the three null and the three
read the three null and the three alternative hypotheses here then we get
alternative hypotheses here then we get the descriptive statistics output and
the descriptive statistics output and here we see the results of the analysis
here we see the results of the analysis of variance and also the post talk test
of variance and also the post talk test we'll look at these again in detail in a
we'll look at these again in detail in a moment but if you don't know exactly how
moment but if you don't know exactly how to interpret the results you can also
to interpret the results you can also click on summary in
click on summary in wordss but now back to the results most
wordss but now back to the results most important in this table are these three
important in this table are these three rows with these three rows you can check
rows with these three rows you can check if the three n hypotheses we stated
if the three n hypotheses we stated before are rejected or not the first row
before are rejected or not the first row test
test hypothesis whether cholesterol level
hypothesis whether cholesterol level changes over time so whether the
changes over time so whether the therapies have an effect on cholesterol
therapies have an effect on cholesterol level the second row tests whether there
level the second row tests whether there is a difference between the respective
is a difference between the respective therapy forms with respect to
therapy forms with respect to cholesterol level and the last row
cholesterol level and the last row checks if there is an interaction
checks if there is an interaction between the two factors you can read the
between the two factors you can read the P value in each case right at the back
P value in each case right at the back here let's say we set the significant
here let's say we set the significant level at 5% if our calculated P value is
level at 5% if our calculated P value is less than
less than 0.05 then the respective null hypothesis
0.05 then the respective null hypothesis is rejected and if the calculated P
is rejected and if the calculated P value is greater than
value is greater than 0.05 then the null hypothesis is not
0.05 then the null hypothesis is not rejected thus we see here that the P
rejected thus we see here that the P value of before middle and end is less
value of before middle and end is less than
than 0.05 and therefore the values at before
0.05 and therefore the values at before middle and end are significantly
middle and end are significantly different in terms of cholesterol levels
different in terms of cholesterol levels the P value in the second row is greater
the P value in the second row is greater than
than 0.05 therefore the types of therapy have
0.05 therefore the types of therapy have no significant influence on the
no significant influence on the cholesterol level it is important to
cholesterol level it is important to note that the mean value over the three
note that the mean value over the three time points is considered here it could
time points is considered here it could also be that in one therapy the blood
also be that in one therapy the blood pressure increases and in the other
pressure increases and in the other therapy the blood pressure pressure
therapy the blood pressure pressure decreases but on average over the time
decreases but on average over the time points the blood pressure is the same if
points the blood pressure is the same if this was the case however we would have
this was the case however we would have an interaction between the therapies and
an interaction between the therapies and the time we test this with the last
the time we test this with the last hypothesis in this case there is no
hypothesis in this case there is no significant interaction between therapy
significant interaction between therapy and time so there is an influence over
and time so there is an influence over time but it does not matter which
time but it does not matter which therapy is used the therapy has has no
therapy is used the therapy has has no significant influence if one of the two
significant influence if one of the two factors has a significant influence the
factors has a significant influence the following two tables show which of the
following two tables show which of the combinations differ significantly so far
combinations differ significantly so far we've explored various types of Anova
we've explored various types of Anova and test now these are all so-called
and test now these are all so-called parametric tests and require certain
parametric tests and require certain assumptions about the data like
assumptions about the data like normality but what happens if our data
normality but what happens if our data doesn't meet these assumptions
doesn't meet these assumptions this is where nonparametric tests come
this is where nonparametric tests come into play let's compare these two
into play let's compare these two families of tests to understand their
families of tests to understand their differences hi in this video I explain
differences hi in this video I explain the difference between parametric and
the difference between parametric and non-parametric hypothesis testing why
non-parametric hypothesis testing why are you interested in this topic you
are you interested in this topic you want to calculate a hypothesis test but
want to calculate a hypothesis test but you don't know exactly what the
you don't know exactly what the difference is between a parametric and a
difference is between a parametric and a non-parametric metric test and you're
non-parametric metric test and you're wondering when to use which test if you
wondering when to use which test if you want to calculate a hypothesis test you
want to calculate a hypothesis test you must first check the assumptions for the
must first check the assumptions for the test one of the most common assumptions
test one of the most common assumptions is that your data is normally
is that your data is normally distributed in simple terms if your data
distributed in simple terms if your data is normally distributed parametric tests
is normally distributed parametric tests are used such as the T Test analysis of
are used such as the T Test analysis of variance or peeron correlation if your
variance or peeron correlation if your data is not not normally distributed
data is not not normally distributed nonparametric tests are used such as men
nonparametric tests are used such as men with u test or spearman's correlation
with u test or spearman's correlation what about the other assumptions of
what about the other assumptions of course you still need to check whether
course you still need to check whether there are other assumptions for the test
there are other assumptions for the test in general however nonparametric tests
in general however nonparametric tests make fewer assumptions than parametric
make fewer assumptions than parametric tests so why use parametric tests at all
tests so why use parametric tests at all parametric tests are generally more
parametric tests are generally more powerful than non-parametric tests what
powerful than non-parametric tests what does that mean here's an example you
does that mean here's an example you have formulated your null hypothesis man
have formulated your null hypothesis man and women are paid equally whether this
and women are paid equally whether this null hypothesis is rejected depends on
null hypothesis is rejected depends on the difference in salary the dispersion
the difference in salary the dispersion of the data and the sample size in a
of the data and the sample size in a parametric test a smaller difference in
parametric test a smaller difference in salary or a smaller sample is usually
salary or a smaller sample is usually sufficient to reject Al hypothesis if
sufficient to reject Al hypothesis if possible always use parametric tests
possible always use parametric tests what is the structural difference
what is the structural difference between parametric and non-parametric
between parametric and non-parametric tests let's take a look at pearon
tests let's take a look at pearon correlation and spearman's rank
correlation and spearman's rank correlation as well as a t test for
correlation as well as a t test for independent samples and the Man withney
independent samples and the Man withney U test let's start with the Pearson and
U test let's start with the Pearson and Spearman correlation the span rank
Spearman correlation the span rank correlation is the nonparametric
correlation is the nonparametric counterpart to the the Pearson
counterpart to the the Pearson correlation what is the difference
correlation what is the difference between the two correlation coefficients
between the two correlation coefficients spean correlation does not use raw data
spean correlation does not use raw data but the ranks of the data let's look at
but the ranks of the data let's look at an example we measure the reaction time
an example we measure the reaction time of eight computer players and ask their
of eight computer players and ask their age when we calculate a Pearson
age when we calculate a Pearson correlation we simply take the two
correlation we simply take the two varibles reaction time and age and
varibles reaction time and age and calculate the PE and correlation
calculate the PE and correlation coefficient
coefficient however we now want to calculate span's
however we now want to calculate span's rank correlation so first we assign a
rank correlation so first we assign a rank to each person for reaction time
rank to each person for reaction time and age the reaction time is already
and age the reaction time is already sorted by size 12 is the smallest value
sorted by size 12 is the smallest value so gets rank one 15 the second smallest
so gets rank one 15 the second smallest so gets rank two and so on and so forth
so gets rank two and so on and so forth we are now doing the same with ag here
we are now doing the same with ag here we have the smallest value there the
we have the smallest value there the second small
second small here the third smallest fourth smallest
here the third smallest fourth smallest and so on and so forth let's take a look
and so on and so forth let's take a look at this in a scatter plot here we see
at this in a scatter plot here we see the raw data of age and reaction time
the raw data of age and reaction time but now we would like to use the
but now we would like to use the rankings so we form ranks from the
rankings so we form ranks from the variables age and reaction time through
variables age and reaction time through this transformation we have now
this transformation we have now distributed our data more evenly to get
distributed our data more evenly to get span's correlation we we simply
span's correlation we we simply calculate peon correlation from the
calculate peon correlation from the ranks so Spearman correlation is equal
ranks so Spearman correlation is equal to peon correlation only that the ranks
to peon correlation only that the ranks are used instead of raw values what
are used instead of raw values what about a test for independent samples and
about a test for independent samples and the man with u test the T test for
the man with u test the T test for independent samples and the man with u
independent samples and the man with u test check whether there is a difference
test check whether there is a difference between two groups an example is the
between two groups an example is the there a difference between the reaction
there a difference between the reaction time of man and women the man with u
time of man and women the man with u test is the nonparametric counterpart to
test is the nonparametric counterpart to the T test for independent samples but
the T test for independent samples but there is an important difference between
there is an important difference between the two tests the T test for independent
the two tests the T test for independent samples tests whether there is a mean
samples tests whether there is a mean difference for both samples the mean
difference for both samples the mean value is calculated and it is tested
value is calculated and it is tested whether these mean values differ
whether these mean values differ significantly the man with u test on the
significantly the man with u test on the other hand checks whether there is a
other hand checks whether there is a rank sum difference how do we calculate
rank sum difference how do we calculate the rank sums for this purpose we sort
the rank sums for this purpose we sort all persons from the smallest to the
all persons from the smallest to the largest value this person has the
largest value this person has the smallest value so gets rank one that
smallest value so gets rank one that person has the second smallest value so
person has the second smallest value so gets rank two and this person has the
gets rank two and this person has the third smallest value and so on and so
third smallest value and so on and so forth now we have assigned a rank rank
forth now we have assigned a rank rank to each person then we can simply add up
to each person then we can simply add up the ranks of the first group and the
the ranks of the first group and the second group in the first group we get a
second group in the first group we get a rank sum of 42 and in the second group a
rank sum of 42 and in the second group a rank sum of 36 now we can investigate
rank sum of 36 now we can investigate whether there is a significant
whether there is a significant difference between these rank sums if
difference between these rank sums if you want to know more about the man with
you want to know more about the man with u test check out my related video so we
u test check out my related video so we can summarize the raw data I used to
can summarize the raw data I used to parametric test tests and the ranks of
parametric test tests and the ranks of the raw data are used for nonparametric
the raw data are used for nonparametric tests the hypothesis test you use
tests the hypothesis test you use usually depends on how many variables
usually depends on how many variables you have and whether it is an
you have and whether it is an independent or dependent sample in most
independent or dependent sample in most cases there is always a nonparametric
cases there is always a nonparametric counterpart to parametric tests so if
counterpart to parametric tests so if you do not meet the assumptions for the
you do not meet the assumptions for the parametric test you can use the
parametric test you can use the non-parametric counterpart but don't
non-parametric counterpart but don't worry data tab will do its best to help
worry data tab will do its best to help you choose the right hypothesis test of
you choose the right hypothesis test of course you can calculate the most common
course you can calculate the most common parametric and nonparametric test with
parametric and nonparametric test with data tab online simply copy your own
data tab online simply copy your own data into the table and your variables
data into the table and your variables will appear here below now click on the
will appear here below now click on the variables you want to calculate a
variables you want to calculate a hypothesis test for for example if you
hypothesis test for for example if you choose salary and gender a t test will
choose salary and gender a t test will be calculated
here you can check the assumptions if the assumptions are not
assumptions if the assumptions are not met you can simply click on
met you can simply click on nonparametric and a man with the U test
nonparametric and a man with the U test will be
will be calculated if you click on salary and
calculated if you click on salary and Company an analysis of variance is
Company an analysis of variance is calculated or in the nonparametric case
calculated or in the nonparametric case the cross Vol
the cross Vol test as we've seen parametric tests rely
test as we've seen parametric tests rely heavily on the assumption that data are
heavily on the assumption that data are normally
normally distributed this leads us to an
distributed this leads us to an essential step in data analysis checking
essential step in data analysis checking our data for
our data for normality before applying parametric
normality before applying parametric tests it's crucial to check this
tests it's crucial to check this assumption otherwise we would get
assumption otherwise we would get inaccurate results let's now look into
inaccurate results let's now look into different methods and statistical tests
different methods and statistical tests to check out data for normal
to check out data for normal distribution in this video I show you
distribution in this video I show you how to test your data for normal
how to test your data for normal distribution first of all why do you
distribution first of all why do you need normal
need normal distribution let's say you've collected
distribution let's say you've collected data and you want to analyze this data
data and you want to analyze this data with an appropriate hypothesis test for
with an appropriate hypothesis test for example a t test or an analysis of
example a t test or an analysis of variant one of the most common
variant one of the most common requirements for hypothesis testing is
requirements for hypothesis testing is that the data used must be normally
that the data used must be normally distributed data are normally
distributed data are normally distributed if the frequency
distributed if the frequency distribution of the data has this bell
distribution of the data has this bell curve now of course the big question is
curve now of course the big question is how do you know if your data is normally
how do you know if your data is normally distributed or not or how can you test
distributed or not or how can you test that there are two ways either you can
that there are two ways either you can check the normal distribution
check the normal distribution analytically or graphically we now look
analytically or graphically we now look at both in
at both in detail let's start with the analytical
detail let's start with the analytical test for normal distribution in order to
test for normal distribution in order to test your data analytically for normal
test your data analytically for normal distribution there are several test
distribution there are several test procedures the best known are the colog
procedures the best known are the colog of smov test the Shapiro wil test and
of smov test the Shapiro wil test and the Anderson darling test with all these
the Anderson darling test with all these tests you test the null hypothesis that
tests you test the null hypothesis that the data are normally distributed so the
the data are normally distributed so the null hypothesis is that the frequency
null hypothesis is that the frequency distribution of your data fits the
distribution of your data fits the normal distribution in in order to
normal distribution in in order to reject or not reject the null hypothesis
reject or not reject the null hypothesis you get a P value out of all these tests
you get a P value out of all these tests now the big question is whether this P
now the big question is whether this P value is greater or less than
value is greater or less than 0.05 if the P value is less than
0.05 if the P value is less than 0.05 this is interpreted as a
0.05 this is interpreted as a significant deviation from the normal
significant deviation from the normal distribution and you can assume that
distribution and you can assume that your data are not normally distributed
your data are not normally distributed if the P value is greater than
if the P value is greater than 0.05 and you want to be statistically
0.05 and you want to be statistically clean you cannot necessarily say that
clean you cannot necessarily say that the frequency distribution corresponds
the frequency distribution corresponds to the normal distribution you just
to the normal distribution you just cannot disprove the null hypothesis in
cannot disprove the null hypothesis in practice however values greater than
practice however values greater than 0.05 are assumed to be normally
0.05 are assumed to be normally distributed to be on a safe side you
distributed to be on a safe side you should always take a look at the graphic
should always take a look at the graphic solution which we will talk about in a
solution which we will talk about in a moment so in summary all these tests
moment so in summary all these tests give you a P value if this P value is
give you a P value if this P value is less than
less than 0.05 you assume no normal distribution
0.05 you assume no normal distribution if it is greater than 0.05 you assume
if it is greater than 0.05 you assume normal distribution for your information
normal distribution for your information with the Koger of smof test and with the
with the Koger of smof test and with the Anderson darling test you can also test
Anderson darling test you can also test distributions other than the normal
distributions other than the normal distribution now unfortunately there is
distribution now unfortunately there is a big disadvantage of the analytical
a big disadvantage of the analytical methods which is why more and more
methods which is why more and more people are switching to using the
people are switching to using the graphical methods the problem is that
graphical methods the problem is that the calculated P value is influenced by
the calculated P value is influenced by the size of the sample therefore if you
the size of the sample therefore if you have a very small sample your P value
have a very small sample your P value may be much larger than
may be much larger than 0.05 but if you if you have a very large
0.05 but if you if you have a very large sample your P value may be smaller than
sample your P value may be smaller than 0.05 let's assume the distribution in
0.05 let's assume the distribution in your population deviates very slightly
your population deviates very slightly from the normal distribution then if you
from the normal distribution then if you take a very small sample you will get a
take a very small sample you will get a very large P value and thus you will
very large P value and thus you will assume that it is normally distributed
assume that it is normally distributed data however if you take a larger sample
data however if you take a larger sample then a P value Bec becomes smaller and
then a P value Bec becomes smaller and smaller even though the samples come
smaller even though the samples come from the same population with the same
from the same population with the same distribution therefore if you have a
distribution therefore if you have a minimal deviation from the normal
minimal deviation from the normal distribution which isn't actually
distribution which isn't actually relevant the larger your sample the
relevant the larger your sample the smaller the P value becomes with a very
smaller the P value becomes with a very large sample you may even get a P value
large sample you may even get a P value smaller than
smaller than 0.05 and thus reject the hypothesis that
0.05 and thus reject the hypothesis that it is a normal distribution to get
it is a normal distribution to get around this problem graphical methods
around this problem graphical methods are being used more and more we'll come
are being used more and more we'll come to that now if the normal distribution
to that now if the normal distribution is checked graphically you either look
is checked graphically you either look at the histogram or even better at the
at the histogram or even better at the QQ plot if you use the histogram you
QQ plot if you use the histogram you plot the normal distribution in the
plot the normal distribution in the histogram of your data and then you can
histogram of your data and then you can see whether the curve of the normal
see whether the curve of the normal distribution ution roughly corresponds
distribution ution roughly corresponds to that of the normal distribution curve
to that of the normal distribution curve however it is better if you use the
however it is better if you use the so-called quantile quantile plot or QQ
so-called quantile quantile plot or QQ plot for short here the theoretical
plot for short here the theoretical quantiles that the data should have if
quantiles that the data should have if they are perfectly normally distributed
they are perfectly normally distributed and the quantiles of the measured values
and the quantiles of the measured values are compared if the data is perfectly
are compared if the data is perfectly normally distributed all points would
normally distributed all points would lie on the line the more the data
lie on the line the more the data deviates from the line the less it is
deviates from the line the less it is normally distributed in addition data de
normally distributed in addition data de plots the 95% confidence interval if all
plots the 95% confidence interval if all or almost all of your data lies within
or almost all of your data lies within this interval it is a very strong
this interval it is a very strong indication that your data is normally
indication that your data is normally distributed your data would not be
distributed your data would not be normally distributed if for example they
normally distributed if for example they form an arc and lie far away from the
form an arc and lie far away from the line in some areas if you use data tap
line in some areas if you use data tap in order to test for normal distribution
in order to test for normal distribution you get the following evaluation first
you get the following evaluation first you get the analytical test procedures
you get the analytical test procedures clearly arranged in a table then come
clearly arranged in a table then come the graphical test procedures how you
the graphical test procedures how you can test your data with data tab for
can test your data with data tab for normal distribution I will show you now
normal distribution I will show you now just copy your data into to this table
just copy your data into to this table click on descriptive statistics and then
click on descriptive statistics and then select the variable you want to test for
select the variable you want to test for normal distribution for example age
normal distribution for example age after that you can simply click on test
after that you can simply click on test for normal distribution here and you
for normal distribution here and you will get the results down here I know
will get the results down here I know the test procedures are not actually
the test procedures are not actually descriptive methods but if you want to
descriptive methods but if you want to get an overview of your data it's
get an overview of your data it's usually also relevant to look at the
usually also relevant to look at the distribution of your data furthermore if
distribution of your data furthermore if you calculate a hypothesis test for
you calculate a hypothesis test for example whether gender has an influence
example whether gender has an influence on the salary of a person then you can
on the salary of a person then you can check the precond conditions for each
check the precond conditions for each hypothesis test and you will also get
hypothesis test and you will also get the test for normal distribution if the
the test for normal distribution if the pr condition is not met you would click
pr condition is not met you would click on this and a non-parametric test the
on this and a non-parametric test the man Whitney U test would be calculated
man Whitney U test would be calculated the man with u test does not need
the man with u test does not need normally distributed data another
normally distributed data another important assumption is the equality of
important assumption is the equality of variance you can check whether two or
variance you can check whether two or more groups have the same variance using
more groups have the same variance using the LaVine test let's take a look at it
the LaVine test let's take a look at it what is a lavine's test lavine's test
what is a lavine's test lavine's test tests the hypothesis that the variances
tests the hypothesis that the variances are equal in different groups the aim is
are equal in different groups the aim is to determine whether the variances in
to determine whether the variances in different groups are significantly
different groups are significantly different from each other the hypotheses
different from each other the hypotheses for Lavin's test are as follows the null
for Lavin's test are as follows the null hypothesis is the variances of the
hypothesis is the variances of the groups are equal and the alternative
groups are equal and the alternative hypothesis is at least one of the groups
hypothesis is at least one of the groups has a different variance when is
has a different variance when is lavine's test most commonly used
lavine's test most commonly used lavine's test is most often used to test
lavine's test is most often used to test a assumptions for another hypothesis
a assumptions for another hypothesis test what does that mean let's say your
test what does that mean let's say your hypothesis is there is a difference
hypothesis is there is a difference between two medications in terms of
between two medications in terms of perceived pain relief to test this
perceived pain relief to test this hypothesis you've collected data now to
hypothesis you've collected data now to test the hypothesis based on your data
test the hypothesis based on your data you use a hypothesis test such as a T
you use a hypothesis test such as a T Test many hypothesis tests have the
Test many hypothesis tests have the assumption that the variances in each
assumption that the variances in each group are equal and this is where thein
group are equal and this is where thein test comes in it tells us whether this
test comes in it tells us whether this assumption is fulfilled or not how is a
assumption is fulfilled or not how is a lavine's test calculated here's an
lavine's test calculated here's an example we want to know if there is a
example we want to know if there is a significant difference in variance
significant difference in variance between these groups first we simply
between these groups first we simply calculate the mean of each group then we
calculate the mean of each group then we subtract the respective group mean from
subtract the respective group mean from each person
each person the amount of each value is now formed
the amount of each value is now formed so that the negative values become
so that the negative values become positive from these new values the group
positive from these new values the group mean can now be calculated again the
mean can now be calculated again the larger the group mean the greater the
larger the group mean the greater the variance within each group thus there is
variance within each group thus there is a smaller variance in this group than in
a smaller variance in this group than in that group in addition we can calculate
that group in addition we can calculate the total mean value now we can
the total mean value now we can calculate the Square deviations of the
calculate the Square deviations of the group means from the overall mean and
group means from the overall mean and sum them up and then we can calculate
sum them up and then we can calculate the square deviation of the individual
the square deviation of the individual values from the respective group mean
values from the respective group mean and add them up we can now compare the
and add them up we can now compare the two calculated sums and that is exactly
two calculated sums and that is exactly what lavine's test does the test
what lavine's test does the test statistic of lavine's test is obtained
statistic of lavine's test is obtained with this equation n is the number of
with this equation n is the number of cases and I the number of cases in the E
cases and I the number of cases in the E Group set I is the mean value of the E
Group set I is the mean value of the E Group set is the overall average set i j
Group set is the overall average set i j is the respective value in the groups
is the respective value in the groups and K is the number of groups the
and K is the number of groups the calculated test statistic L is equal to
calculated test statistic L is equal to the F statistic therefore with the F
the F statistic therefore with the F value and the degrees of freedom the p
value and the degrees of freedom the p value can be calculated the degrees of
value can be calculated the degrees of freedom result with number of groups
freedom result with number of groups minus one and number of cases minus
minus one and number of cases minus number of groups if the P value is
number of groups if the P value is greater than
greater than 0.05 the null hypothesis that the
0.05 the null hypothesis that the variances are equal is not rejected thus
variances are equal is not rejected thus equality of variance can be assumed if
equality of variance can be assumed if you use data Tab and calculate an
you use data Tab and calculate an analysis of variance you can find
analysis of variance you can find lavine's test under test
assumptions in an independent T Test you will find lavine's test at the bottom of
will find lavine's test at the bottom of the results if equality of variance is
the results if equality of variance is not given you can use the T test for
not given you can use the T test for unequal
unequal variance now that we understand the
variance now that we understand the importance of testing for normal
importance of testing for normal distribution we might find ourselves in
distribution we might find ourselves in situations where these assumption are
situations where these assumption are not met in this case we turn to
not met in this case we turn to nonparametric methods that are less
nonparametric methods that are less sensitive to the distribution of data we
sensitive to the distribution of data we will discuss various kinds of tests for
will discuss various kinds of tests for this purpose like man with u test cross
this purpose like man with u test cross called Wallace test will coxen signed
called Wallace test will coxen signed rank test and fredman test let's start
rank test and fredman test let's start with the non-parametric counterpart to
with the non-parametric counterpart to the T test for independent samples The
the T test for independent samples The Man withney U test what is a man Whitney
Man withney U test what is a man Whitney U test and how is it calculated that's
U test and how is it calculated that's what we will discuss in this video let's
what we will discuss in this video let's start with the first question what is a
start with the first question what is a man Whitney UT test a man Whitney UT
man Whitney UT test a man Whitney UT test tests whether there is a difference
test tests whether there is a difference between two independent samples an
between two independent samples an example is there a difference between
example is there a difference between the reaction time of women and men but
the reaction time of women and men but the T test for independent samples does
the T test for independent samples does the same it also tests whether there is
the same it also tests whether there is the difference between two independent
the difference between two independent samples that's right the manw U test is
samples that's right the manw U test is the nonparametric counter part to the T
the nonparametric counter part to the T test for independent samples but there
test for independent samples but there is an important difference between the
is an important difference between the two tests the T test for independent
two tests the T test for independent samples tests whether there is a mean
samples tests whether there is a mean difference for both samples the mean
difference for both samples the mean value is calculated and it is tested
value is calculated and it is tested whether these mean values differ
whether these mean values differ significantly the man would need you
significantly the man would need you test on the other hand checks whether
test on the other hand checks whether there is a rank sum difference how do we
there is a rank sum difference how do we calculate the rank sum for this purpose
calculate the rank sum for this purpose we sort all persons from the smallest to
we sort all persons from the smallest to the largest value this person has the
the largest value this person has the smallest value so gets rank one this
smallest value so gets rank one this person has the second smallest value so
person has the second smallest value so gets rank two and this person has the
gets rank two and this person has the third smallest value and so on and so
third smallest value and so on and so forth now we have assigned a rank to to
forth now we have assigned a rank to to each person then we can simply add up
each person then we can simply add up the ranks of the first group and the
the ranks of the first group and the second group in the first group we get a
second group in the first group we get a rank of 42 and in the second group a
rank of 42 and in the second group a rank of 36 now we can investigate
rank of 36 now we can investigate whether there is a significant
whether there is a significant difference between these rank sums but
difference between these rank sums but more on that later the advantage of
more on that later the advantage of taking the rank sums rather than the
taking the rank sums rather than the difference in means is that the data
difference in means is that the data need not to be normally distributed so
need not to be normally distributed so in contrast to the T Test the data in
in contrast to the T Test the data in the man with u test do not have to be
the man with u test do not have to be normally distributed what are the
normally distributed what are the hypotheses of the Man witne U test the
hypotheses of the Man witne U test the null hypothesis is in the two samples
null hypothesis is in the two samples the rank sums do not differ
the rank sums do not differ significantly the alternative hypothesis
significantly the alternative hypothesis is in the two samples the rank sums do
is in the two samples the rank sums do differ significantly now let's go
differ significantly now let's go through everything with an example first
through everything with an example first we calculate the example with data Tab
we calculate the example with data Tab and then we see if we can get the same
and then we see if we can get the same results by hand if you like you can load
results by hand if you like you can load the sample data set to follow the
the sample data set to follow the example you can find the link in the
example you can find the link in the video description we simply click on
video description we simply click on data.net and open the statistics
data.net and open the statistics calculator I've already loaded the data
calculator I've already loaded the data from the link here you can also copy
from the link here you can also copy your own data into this table then all
your own data into this table then all you have to do is click on the
you have to do is click on the hypothesis test Tab and then simply
hypothesis test Tab and then simply select the desired variables we measure
select the desired variables we measure the reaction time of a group of men and
the reaction time of a group of men and women and want to know if there is a
women and want to know if there is a difference in reaction time so we click
difference in reaction time so we click on response time and gender we don't
on response time and gender we don't want to calculate a t test for
want to calculate a t test for independent samples but a man Whitney U
independent samples but a man Whitney U test so let's just click on
test so let's just click on non-parametric test here we see the
non-parametric test here we see the results of the Man whitne U test if
results of the Man whitne U test if you're not sure how to interpret the
you're not sure how to interpret the results just click on summary inverts
results just click on summary inverts for the given data a Man withney U test
for the given data a Man withney U test showed that the difference between
showed that the difference between female and male with respect to the
female and male with respect to the dependent variable response time was not
dependent variable response time was not statistically significant thus the null
statistically significant thus the null hypothesis is not
hypothesis is not rejected so we now calculate the man
rejected so we now calculate the man with u test by hand for this we have
with u test by hand for this we have plotted the values in a table on one
plotted the values in a table on one side we have gender with female and male
side we have gender with female and male and on the other side the values for
and on the other side the values for reaction time unfortunately the data is
reaction time unfortunately the data is not normally distributed so we cannot
not normally distributed so we cannot use a t test and we calculate the Man
use a t test and we calculate the Man withney U test instead first we assign a
withney U test instead first we assign a rank to each value we pick the smallest
rank to each value we pick the smallest value which is 33 which gets the rank
value which is 33 which gets the rank one the second smallest value is 34
one the second smallest value is 34 which gets the rank two the third
which gets the rank two the third smallest value is 35 five which gets the
smallest value is 35 five which gets the rank three now we do the same for all
rank three now we do the same for all other values so now we have all ranks
other values so now we have all ranks assigned and we can just add up all the
assigned and we can just add up all the ranks from women and all the ranks from
ranks from women and all the ranks from Men the rank sum is abbreviated with t
Men the rank sum is abbreviated with t and we get T1 for female with 2 + 4 + 7
and we get T1 for female with 2 + 4 + 7 + 9 + 10 + 5 which is 37 now we do the
+ 9 + 10 + 5 which is 37 now we do the same for male here we get get 11 + 1 + 3
same for male here we get get 11 + 1 + 3 + 6 + 8 which is 29 again our Nal
+ 6 + 8 which is 29 again our Nal habesis is that both rank sums are equal
habesis is that both rank sums are equal now we want to calculate the P value for
now we want to calculate the P value for this we have once calculated the rank
this we have once calculated the rank sum for the female participants and we
sum for the female participants and we have the number of cases of six
have the number of cases of six therefore we have six female subjects we
therefore we have six female subjects we can now calculate the U1 that that is
can now calculate the U1 that that is the U for the female participants using
the U for the female participants using this formula here we have N1 and N2 that
this formula here we have N1 and N2 that is the number of cases of female and
is the number of cases of female and male minus the rank sum of the female
male minus the rank sum of the female participants if we insert our values we
participants if we insert our values we get a U1 of 14 we now do exactly the
get a U1 of 14 we now do exactly the same for the male participants and we
same for the male participants and we get a U2 of 16 so now we have calculated
get a U2 of 16 so now we have calculated U1 and U2 the U for the man with u test
U1 and U2 the U for the man with u test is now given by the smaller value of the
is now given by the smaller value of the two so in our case we take the minimum
two so in our case we take the minimum of 14 and 16 this is of course 14 next
of 14 and 16 this is of course 14 next we need to calculate the expected value
we need to calculate the expected value of U which we get by N1 * N2 / 2 in our
of U which we get by N1 * N2 / 2 in our case it is 6 * 5 / 2 and that is equal
case it is 6 * 5 / 2 and that is equal to 15 last but not least we need the
to 15 last but not least we need the standard error of U the standard error
standard error of U the standard error can be calculated with this formula and
can be calculated with this formula and in our case it is equal to
54772 with all these values we can now calculate Z the Z value results with u
calculate Z the Z value results with u minus mu U divided by the standard error
minus mu U divided by the standard error in our case we get 14 - 15 / 5
in our case we get 14 - 15 / 5 54772 which is equal to -
01825 so now we have the set value and with the set value we can calculate the
with the set value we can calculate the P value however it should be noted
P value however it should be noted depending on how large the sample is the
depending on how large the sample is the P value for the man with u test is
P value for the man with u test is calculated in different ways for up to
calculated in different ways for up to 25 cases the exact values are used which
25 cases the exact values are used which can be read from a table table for large
can be read from a table table for large samples the normal distribution of the U
samples the normal distribution of the U value can be used as an approximation in
value can be used as an approximation in our example we would actually use the
our example we would actually use the exact values nevertheless we assume a
exact values nevertheless we assume a normal distribution for this we can
normal distribution for this we can simply go to data and calculate the P
simply go to data and calculate the P value for a given set value the P value
value for a given set value the P value of
of 0.855 is significantly greater than the
0.855 is significantly greater than the significance level of 0.05
significance level of 0.05 and thus the null hypothesis cannot be
and thus the null hypothesis cannot be rejected based on this sample how to
rejected based on this sample how to calculate the man with you test on tide
calculate the man with you test on tide ranks you can learn in our tutorial on
ranks you can learn in our tutorial on data.net you find the link in the video
data.net you find the link in the video description but what if we want to
description but what if we want to compare to dependent samples and need a
compare to dependent samples and need a non-parametric test let's take a look at
non-parametric test let's take a look at the Willcox signed rank test in this
the Willcox signed rank test in this video I will explain the will coxen test
video I will explain the will coxen test to you with will go through what will
to you with will go through what will coxen test is what the assumptions are
coxen test is what the assumptions are and how it is calculated and at the end
and how it is calculated and at the end I will show you how you can easily
I will show you how you can easily calculate the will coxen test online
calculate the will coxen test online with data Tab and we get started right
with data Tab and we get started right now the will coxen test analyzes whether
now the will coxen test analyzes whether there's a difference between two
there's a difference between two dependent samples or not therefore if
dependent samples or not therefore if you have two dependent groups you want
you have two dependent groups you want to test whether there is a difference
to test whether there is a difference between these two groups then and you
between these two groups then and you can use the will coxen test now you
can use the will coxen test now you rightly say hey the T test for dependent
rightly say hey the T test for dependent samples does the same thing it also
samples does the same thing it also tests whether there's a difference
tests whether there's a difference between two dependent groups that's
between two dependent groups that's correct of course the will coxen test is
correct of course the will coxen test is the nonparametric counterpart of the T
the nonparametric counterpart of the T test for dependent samples the special
test for dependent samples the special thing about will coxen test is that your
thing about will coxen test is that your data do not have to be normally
data do not have to be normally distributed to put it simple if your
distributed to put it simple if your data are normally distributed you use a
data are normally distributed you use a parametric test in the case of two
parametric test in the case of two dependent samples this is the T test for
dependent samples this is the T test for dependent
dependent samples if your data is not normally
samples if your data is not normally distributed you use a nonparametric test
distributed you use a nonparametric test in the case of two dependent samples
in the case of two dependent samples this would be the wil coxen
this would be the wil coxen test now of course you could say hm well
test now of course you could say hm well then I'll just always use Willl coxen
then I'll just always use Willl coxen test and I don't even have to check the
test and I don't even have to check the normal distribution at the end of this
normal distribution at the end of this video I will show you why you should
video I will show you why you should always use the T Test if it's possible
always use the T Test if it's possible to do
to do so first I have a little reminder for
so first I have a little reminder for you what dependent samples are in
you what dependent samples are in dependent samples the measured values
dependent samples the measured values are always available in pairs the pairs
are always available in pairs the pairs result from for example repeated
result from for example repeated measures of the same same person but
measures of the same same person but what is the difference now between the T
what is the difference now between the T test for dependent samples and the will
test for dependent samples and the will coxen test the T test for dependent
coxen test the T test for dependent samples tests whether there's a
samples tests whether there's a difference in
difference in means if we have a dependent sample say
means if we have a dependent sample say we took a value from each person once in
we took a value from each person once in the morning and once in the evening then
the morning and once in the evening then we can calculate the difference from
we can calculate the difference from each pair so for example we would have
each pair so for example we would have 45 5 - 34 which equals
45 5 - 34 which equals 11 the T test for dependent sample now
11 the T test for dependent sample now tests whether these differences differ
tests whether these differences differ from zero or not in The Wil coxon test
from zero or not in The Wil coxon test we don't use the differences of means
we don't use the differences of means but we form ranks and then we compare
but we form ranks and then we compare these ranks with each other three is the
these ranks with each other three is the smallest value in terms of amount it
smallest value in terms of amount it gets rank one four four is the second
gets rank one four four is the second smallest value and it gets rank two six
smallest value and it gets rank two six gets rank three and 11 gets rank four we
gets rank three and 11 gets rank four we assign a plus to all positive values and
assign a plus to all positive values and a minus to all negative values but don't
a minus to all negative values but don't worry we will go through this slowly and
worry we will go through this slowly and we'll also look at an example now we go
we'll also look at an example now we go to the assumptions and the hypotheses in
to the assumptions and the hypotheses in The Wil coxon test only two dependent r
The Wil coxon test only two dependent r random samples with at least ordinary
random samples with at least ordinary scaled characteristics need to be
scaled characteristics need to be present the variables do not have to
present the variables do not have to satisfy a distribution curve what should
satisfy a distribution curve what should be mentioned now however is that the
be mentioned now however is that the distribution shape of the differences of
distribution shape of the differences of the two dependent samples should be
the two dependent samples should be approximately symmetric the null
approximately symmetric the null hypothesis in the will coxen test is
hypothesis in the will coxen test is that there is no difference in the
that there is no difference in the so-called central tendency of the two
so-called central tendency of the two samples in the population that is that
samples in the population that is that there is no difference between the
there is no difference between the dependent groups the alternative
dependent groups the alternative hypothesis is that there is a difference
hypothesis is that there is a difference in the central tendency in the
in the central tendency in the population so we expect that the two
population so we expect that the two dependent groups are different so now we
dependent groups are different so now we finally look at a quite simple example
finally look at a quite simple example let's say you have measured the reaction
let's say you have measured the reaction time of a small group group of people
time of a small group group of people once in the morning and once in the
once in the morning and once in the evening and you want to know if there is
evening and you want to know if there is a difference between morning and evening
a difference between morning and evening in order to do this you measure the
in order to do this you measure the reaction time of seven people in the
reaction time of seven people in the morning and in the evening the measured
morning and in the evening the measured values are therefore available now in
values are therefore available now in pairs and now you want to calculate the
pairs and now you want to calculate the difference between morning and evening
difference between morning and evening if the difference would be normally
if the difference would be normally distributed you would use a t test for
distributed you would use a t test for dependent samples if it is not we would
dependent samples if it is not we would use a wil coxen test let's just assume
use a wil coxen test let's just assume that there is no normal distribution and
that there is no normal distribution and we need to calculate a will coxen test
we need to calculate a will coxen test in order to do this the first thing we
in order to do this the first thing we do is we form ranks we look for the
do is we form ranks we look for the smallest value in terms of amount that
smallest value in terms of amount that is minus 2 which gets rank one what is
is minus 2 which gets rank one what is the second smallest value that is three
the second smallest value that is three which gets ranked two and so on and so
which gets ranked two and so on and so forth until we have ranked all the
forth until we have ranked all the values the next thing is that we look at
values the next thing is that we look at the differences and we try to figure out
the differences and we try to figure out which difference is positive and which
which difference is positive and which is a negative one for the negative
is a negative one for the negative differences we simply add a minus then
differences we simply add a minus then we can add up the positive ranks and the
we can add up the positive ranks and the negative ranks for the positive r ranks
negative ranks for the positive r ranks we get 7 + 2 + 3 + 4 + 6 which is equal
we get 7 + 2 + 3 + 4 + 6 which is equal to
to 22 for the negative ranks we get 5 + 1
22 for the negative ranks we get 5 + 1 which is equal to 6 if there is no
which is equal to 6 if there is no difference between morning and evening
difference between morning and evening the positive and negative ranks should
the positive and negative ranks should be approximately equal therefore the
be approximately equal therefore the null hypothesis is that both rank sums
null hypothesis is that both rank sums are equal but how can we test test this
are equal but how can we test test this now we have the sum of ranks and we use
now we have the sum of ranks and we use it to calculate the test statistic W
it to calculate the test statistic W this is simply the minimum value of t+
this is simply the minimum value of t+ and T minus in our case it is the
and T minus in our case it is the minimum value of 22 and six and
minimum value of 22 and six and therefore the test statistic W in our
therefore the test statistic W in our case is six now we can further calculate
case is six now we can further calculate the value for t or W that we would
the value for t or W that we would expect if there was no difference
expect if there was no difference between morning and evening in this case
between morning and evening in this case we would get a value of
we would get a value of 14 therefore if there is no difference
14 therefore if there is no difference between morning and evening we would
between morning and evening we would actually expect a value for t plus and T
actually expect a value for t plus and T minus of 14 and thus W would also be 14
minus of 14 and thus W would also be 14 further we can calculate the standard
further we can calculate the standard deviation this is given by this to be
deviation this is given by this to be fair a little bit complicated formula
fair a little bit complicated formula once we finished with that we can now
once we finished with that we can now calculate the set value the set value is
calculate the set value the set value is obtained by calculating W minus mu and
obtained by calculating W minus mu and dividing that by the standard deviation
dividing that by the standard deviation so we compare the value that would be
so we compare the value that would be expected if there was no difference with
expected if there was no difference with the value that actually occurred note
the value that actually occurred note that if there are more than 25 cases
that if there are more than 25 cases normal distribution is assumed in which
normal distribution is assumed in which case we can calculate the set value
case we can calculate the set value using this formula if there are less
using this formula if there are less than 25 values the critical T value is
than 25 values the critical T value is read from a table of critical T values
read from a table of critical T values therefore in our case we would actually
therefore in our case we would actually use the table now I will show you how
use the table now I will show you how you can easily calculate the will coxen
you can easily calculate the will coxen test online and then I will go into why
test online and then I will go into why I should always prefer the dependent T
I should always prefer the dependent T Test to the will coxen test if it's
Test to the will coxen test if it's possible in order to calculate the will
possible in order to calculate the will coxen test simply go to data.net you
coxen test simply go to data.net you will also find the link in the video
will also find the link in the video description and you copy your own data
description and you copy your own data into this
table then you click on this tab and you will see the names of all your variables
will see the names of all your variables that you copied into the table above
that you copied into the table above underneath this tab many hypothesis
underneath this tab many hypothesis tests are summarized and data tab
tests are summarized and data tab automatically suggests the appropriate
automatically suggests the appropriate hypothesis test for your data if you now
hypothesis test for your data if you now select morning and evening data daab
select morning and evening data daab automatically recognizes that it is a
automatically recognizes that it is a dependent sample and calculates the
dependent sample and calculates the dependent T Test but we don't want to
dependent T Test but we don't want to calculate a T Test we want to calculate
calculate a T Test we want to calculate the Willl coxen test so we just click
the Willl coxen test so we just click here now data automatically calculates
here now data automatically calculates The Wil coxen test here we can read the
The Wil coxen test here we can read the negative and positive ranks and here we
negative and positive ranks and here we see the Z value and the P value if you
see the Z value and the P value if you don't know exactly how this is
don't know exactly how this is interpreted just look at the summary in
interpreted just look at the summary in words it says that the morning group had
words it says that the morning group had lower values than the evening group and
lower values than the evening group and a wil coxen test showed that this
a wil coxen test showed that this difference was not statistically
difference was not statistically significant P equals 312 and now we come
significant P equals 312 and now we come as promised to the point why you should
as promised to the point why you should always prefer a parametric test for
always prefer a parametric test for example the T Test to a non-parametric
example the T Test to a non-parametric test we already discussed that The Wil
test we already discussed that The Wil coxen test has fewer requirements than a
coxen test has fewer requirements than a t test now of course the question may be
t test now of course the question may be why do I use parametric tests like the T
why do I use parametric tests like the T Test at all parametric tests usually
Test at all parametric tests usually have a greater test strength than
have a greater test strength than nonparametric tests what does that mean
nonparametric tests what does that mean say you have formulated your null
say you have formulated your null hypothesis for example reactivity is the
hypothesis for example reactivity is the same in the morning and in the evening
same in the morning and in the evening whether the null hypothesis is rejected
whether the null hypothesis is rejected depends among other things on the
depends among other things on the difference in
difference in responsiveness and also on the sample
responsiveness and also on the sample size in a parametric test usually a
size in a parametric test usually a smaller difference or a smaller sample
smaller difference or a smaller sample is sufficient to reject the null
is sufficient to reject the null hypothesis so if possible always use a
hypothesis so if possible always use a parametric test and finally we can take
parametric test and finally we can take a look at the non-parametric
a look at the non-parametric counterparts of the Anova let's start
counterparts of the Anova let's start with the cral Wallace test this tutorial
with the cral Wallace test this tutorial is about the cral Wallace test if you
is about the cral Wallace test if you want to know what the cral Wallace test
want to know what the cral Wallace test is and how it can be calculated and
is and how it can be calculated and interpreted you're at the right place
interpreted you're at the right place and at the end of this video I will show
and at the end of this video I will show you how you can easily calculate the
you how you can easily calculate the Cross C Wallace test online and we get
Cross C Wallace test online and we get started right now the cral Wallis test
started right now the cral Wallis test is a hypothesis test that is used when
is a hypothesis test that is used when you want to test whether there is a
you want to test whether there is a difference between several independent
difference between several independent groups now you may wand a little bit and
groups now you may wand a little bit and say hey if there are several independent
say hey if there are several independent groups I use an analysis of variant
groups I use an analysis of variant that's right but if your data are not
that's right but if your data are not normally dist distributed and the
normally dist distributed and the assumptions for the analysis of variance
assumptions for the analysis of variance are not met the cral Wallace test is
are not met the cral Wallace test is used the cral Wallis test is the
used the cral Wallis test is the non-parametric counterpart of the single
non-parametric counterpart of the single factor analysis of variance I will now
factor analysis of variance I will now show you what that means there is an
show you what that means there is an important difference between the two
important difference between the two tests the analysis of variance tests if
tests the analysis of variance tests if there is a difference in means so when
there is a difference in means so when we have our groups we calculate the mean
we have our groups we calculate the mean of the group and check if all the means
of the group and check if all the means are equal when we look at a cross Cod
are equal when we look at a cross Cod wace test on the other hand we don't
wace test on the other hand we don't check if the means are equal we check if
check if the means are equal we check if the rank sums of all the groups are
the rank sums of all the groups are equal what does that mean now what is a
equal what does that mean now what is a rank and what is a rank sum in the cral
rank and what is a rank sum in the cral volis test we do not use the actual
volis test we do not use the actual measured values but we sort all people
measured values but we sort all people by size and then the person with the
by size and then the person with the small smallest value gets the new value
small smallest value gets the new value or rank one the person with the second
or rank one the person with the second smallest value gets rank two the person
smallest value gets rank two the person with the third smallest value gets rank
with the third smallest value gets rank three and so on and so forth until each
three and so on and so forth until each person has been assigned a rank now we
person has been assigned a rank now we have assigned a rank to each person and
have assigned a rank to each person and then we can simply add up the ranks from
then we can simply add up the ranks from the first group add up the ranks from
the first group add up the ranks from the second group and add up the ranks
the second group and add up the ranks from the Third group in this case we get
from the Third group in this case we get a rank sum of 42 for the first group 70
a rank sum of 42 for the first group 70 for the second group and 47 for the
for the second group and 47 for the third group the big Advantage is that if
third group the big Advantage is that if we do not look at the main difference
we do not look at the main difference but at the rank sum the data does not
but at the rank sum the data does not have to be normally distributed when
have to be normally distributed when using the cral wace test our data does
using the cral wace test our data does not have to satisfy any distributional
not have to satisfy any distributional form and therefore we also don't need
form and therefore we also don't need needed to be normally distributed before
needed to be normally distributed before we discuss how the crustal wace test is
we discuss how the crustal wace test is calculated and don't worry it's really
calculated and don't worry it's really not complicated we first take a look at
not complicated we first take a look at the assumptions when do we use the cral
the assumptions when do we use the cral Wallace test we use the cral volis test
Wallace test we use the cral volis test if we have a nominal or ordinal variable
if we have a nominal or ordinal variable with more than two values and a matric
with more than two values and a matric variable a nominal or ordinal variable
variable a nominal or ordinal variable with more than two values is for example
with more than two values is for example the variable preferred newspaper with
the variable preferred newspaper with the values Washington Post New York
the values Washington Post New York Times USA Today it could also be
Times USA Today it could also be frequency of Television viewing with
frequency of Television viewing with daily several times a week rarely never
daily several times a week rarely never a matric variable is for example salary
a matric variable is for example salary well-being or weight of people what are
well-being or weight of people what are the assumptions now only several
the assumptions now only several independent random samples with at least
independent random samples with at least ordinary scaled characteristics must be
ordinary scaled characteristics must be available the variables do not have to
available the variables do not have to satisfy a distribution curve so the null
satisfy a distribution curve so the null hypothesis is the independent samples
hypothesis is the independent samples all have the same central tendency and
all have the same central tendency and therefore come from the same population
therefore come from the same population or in other words there's no difference
or in other words there's no difference in the rank sums and the alternative
in the rank sums and the alternative hypothesis could be at least one of the
hypothesis could be at least one of the independent samples does not have the
independent samples does not have the same central tendency as as the other
same central tendency as as the other samples and therefore comes from a
samples and therefore comes from a different population or to say it in
different population or to say it in other words again at least one group
other words again at least one group differs in rank sums so the next
differs in rank sums so the next question is how do we calculate a ccal
question is how do we calculate a ccal Wallace test it's not
Wallace test it's not difficult let's say you have measured
difficult let's say you have measured the reaction time of three groups group
the reaction time of three groups group a group b and Group C and now you want
a group b and Group C and now you want to know if there's a difference between
to know if there's a difference between the groups in terms of reaction time
the groups in terms of reaction time time let's say you have written down the
time let's say you have written down the measured reaction time in a table let's
measured reaction time in a table let's just assume that the data is not
just assume that the data is not normally distributed and therefore you
normally distributed and therefore you have to use the cral volis test so then
have to use the cral volis test so then our null hypothesis is that there is no
our null hypothesis is that there is no difference between the
difference between the groups and we're going to test that
groups and we're going to test that right
right now first we assign a rank to each
now first we assign a rank to each person this is the smallest value so
person this is the smallest value so this person gets rank one this is the
this person gets rank one this is the second smallest value so this person
second smallest value so this person gets rank two and we do this now for all
gets rank two and we do this now for all people if the groups have no influence
people if the groups have no influence on reaction time the ranks should
on reaction time the ranks should actually be distributed purely randomly
actually be distributed purely randomly in the second step we now calculate the
in the second step we now calculate the rank sum and the mean rank sum for the
rank sum and the mean rank sum for the first group the rank sum is 2 + 4 + 7 +
first group the rank sum is 2 + 4 + 7 + 9 which is equal to 22 and we have four
9 which is equal to 22 and we have four people in the group so the mean rank sum
people in the group so the mean rank sum is 22 / 4 which equals
is 22 / 4 which equals 5.5 now we do the same for the second
5.5 now we do the same for the second group here we get a rank sum of 27 and
group here we get a rank sum of 27 and the mean rank sum of
the mean rank sum of 6.75 and for the third group we get a
6.75 and for the third group we get a rank sum of 29 and a mean rank sum of
rank sum of 29 and a mean rank sum of 7.25 now we can calculate the expected
7.25 now we can calculate the expected value of the rank sums the expected
value of the rank sums the expected value if there is no difference in the
value if there is no difference in the groups would be that each group would
groups would be that each group would have a rank sum of
have a rank sum of 6.5 we've now almost got everything we
6.5 we've now almost got everything we need we interviewed 12 people so the
need we interviewed 12 people so the number of cases is 12 the expected value
number of cases is 12 the expected value of the ranks is
of the ranks is 6.5 we've also calculated the mean rank
6.5 we've also calculated the mean rank sums of the individual groups the
sums of the individual groups the degrees of freedom in our case are two
degrees of freedom in our case are two and these are simply given by the number
and these are simply given by the number of groups minus one which makes 3 - 1
of groups minus one which makes 3 - 1 last we need the variance the variance
last we need the variance the variance of ranks is given by n to the^ of 2 -1
of ranks is given by n to the^ of 2 -1 divided by 12 n is again the number of
divided by 12 n is again the number of people so 12 so we get a variance of
people so 12 so we get a variance of 11.9 to now we've got everything we need
11.9 to now we've got everything we need with these values we can now calculate
with these values we can now calculate our test value H the test statistic H
our test value H the test statistic H corresponds to the G Square value and is
corresponds to the G Square value and is given by this formula n is the number of
given by this formula n is the number of cases R Bar is the mean rank sum of the
cases R Bar is the mean rank sum of the individual groups and E is the expected
individual groups and E is the expected value of the ranks Sigma squared is the
value of the ranks Sigma squared is the variance of the ranks in our case the
variance of the ranks in our case the number of cases is 12 we always have
number of cases is 12 we always have four people per group so we can pull out
four people per group so we can pull out the ne 5.5 is the mean rank of group a
the ne 5.5 is the mean rank of group a 6.75 is the mean rank of Group B and
6.75 is the mean rank of Group B and 7.25 is the mean rank of Group C this
7.25 is the mean rank of Group C this gives us a rounded AG value of
gives us a rounded AG value of 0.5 as we just said this value
0.5 as we just said this value corresponds to the G Square value so now
corresponds to the G Square value so now we can easily read the critical gsquare
we can easily read the critical gsquare value in the table of critical G Square
value in the table of critical G Square values you find this table on data.net
values you find this table on data.net we have two degrees of freedom and if we
we have two degrees of freedom and if we assume that we have a significance level
assume that we have a significance level of
of 0.05 we get a critical G squ value of
0.05 we get a critical G squ value of 5991 so of course our value is smaller
5991 so of course our value is smaller than the critical G squ value and So
than the critical G squ value and So based on our example data the Nile
based on our example data the Nile hypothesis is retained and now I will
hypothesis is retained and now I will show you how I can easily calculate the
show you how I can easily calculate the cral volis test online with data tab in
cral volis test online with data tab in order to do this you simply visit
order to do this you simply visit data.net net you will find a link in the
data.net net you will find a link in the video description and then you click on
video description and then you click on the statistics calculator and insert
the statistics calculator and insert your own data into this
table further you click on this tab and under this tab you will find many
under this tab you will find many hypothesis tests and when you select the
hypothesis tests and when you select the variables you want to test data tab will
variables you want to test data tab will suggest the appropriate test after
suggest the appropriate test after you've copied your data into the table
you've copied your data into the table you will see the reaction time and group
you will see the reaction time and group right here at the bottom now we simply
right here at the bottom now we simply click on reaction time and group and
click on reaction time and group and data tab automatically calculates an
data tab automatically calculates an analysis of variance for us but we don't
analysis of variance for us but we don't want an analysis of variance we want the
want an analysis of variance we want the non-parametric test so we just click
non-parametric test so we just click here now data automatically calculates
here now data automatically calculates the cral volis test we also get a CH
the cral volis test we also get a CH Square value of
Square value of 0.5 the degrees of freedom are two and
0.5 the degrees of freedom are two and the calculated P value is
the calculated P value is 0.779 and here below you can read the
0.779 and here below you can read the interpretation in words across calist
interpretation in words across calist has showed that there is no significant
has showed that there is no significant difference between the categories P
difference between the categories P makes
makes 0.779 therefore with the data used the
0.779 therefore with the data used the null hypothesis is not rejected if we
null hypothesis is not rejected if we have three or more dependent samples we
have three or more dependent samples we can use the fredman test as a
can use the fredman test as a nonparametric alternative to the
nonparametric alternative to the repeated measures Anover this video is
repeated measures Anover this video is about the Freedman test and we start
about the Freedman test and we start right away with the first question what
right away with the first question what is the fredman
is the fredman test the fredman test analyzes whether
test the fredman test analyzes whether there are statistically significant
there are statistically significant differences between three or more
differences between three or more dependent
dependent samples what is a dependent sample again
samples what is a dependent sample again in a dependent sample the measured
in a dependent sample the measured values are linked for example if a
values are linked for example if a sample is drawn of people who have knee
sample is drawn of people who have knee surgery and these people are interviewed
surgery and these people are interviewed before the surgery and one and two weeks
before the surgery and one and two weeks after the surgery it is a dependent
after the surgery it is a dependent sample this is the case because the same
sample this is the case because the same person was interviewed at multiple time
person was interviewed at multiple time points now you might rightly say that
points now you might rightly say that the analysis of variance with repeated
the analysis of variance with repeated measures tests exactly the same thing
measures tests exactly the same thing since it also tests whether there is a
since it also tests whether there is a difference between three or more
difference between three or more dependent samples that is correct the
dependent samples that is correct the Freedman test is the nonparametric
Freedman test is the nonparametric counter part of the analysis of variance
counter part of the analysis of variance with repeated measures but what is the
with repeated measures but what is the difference between the two tests the
difference between the two tests the analysis of variance tests the extent to
analysis of variance tests the extent to which the measured values of the
which the measured values of the dependent sample differ in the Freedman
dependent sample differ in the Freedman test on the other hand it is not the
test on the other hand it is not the actual measured values that are used but
actual measured values that are used but the ranks the time where a person has
the ranks the time where a person has the highest value gets rank one the time
the highest value gets rank one the time with the second highest value gets rank
with the second highest value gets rank two and the time with the smallest value
two and the time with the smallest value gets rank three this is now done for all
gets rank three this is now done for all people or for all rows afterwards the
people or for all rows afterwards the ranks of the single points of time are
ranks of the single points of time are added up at the first time point we get
added up at the first time point we get a sum of seven at the second time point
a sum of seven at the second time point we get a sum of eight and at the third
we get a sum of eight and at the third time point we get a sum of nine now we
time point we get a sum of nine now we can check how much these rank sums
can check how much these rank sums differ from each
differ from each other but why are rank sums used the big
other but why are rank sums used the big Advantage is that if you don't look at
Advantage is that if you don't look at the mean differences but at the rank sum
the mean differences but at the rank sum the data doesn't have to be normal
the data doesn't have to be normal distributed to put it simple if your
distributed to put it simple if your data are normally distributed parametric
data are normally distributed parametric tests are used for more than two
tests are used for more than two dependent samples this is the analysis
dependent samples this is the analysis of variance with repeated measures if
of variance with repeated measures if your data are not normally distributed
your data are not normally distributed nonparametric tests are used for more
nonparametric tests are used for more than two dependent samples this is the
than two dependent samples this is the fredman test this leads us to the
fredman test this leads us to the research question that you can answer
research question that you can answer with with the freedan test is there a
with with the freedan test is there a significant difference between more than
significant difference between more than two dependent groups let's have a look
two dependent groups let's have a look at that with an example you might be
at that with an example you might be interested to know whether therapy after
interested to know whether therapy after a slipped disc has an influence on the
a slipped disc has an influence on the patient's perception of pain for this
patient's perception of pain for this purpose you measure the pain perception
purpose you measure the pain perception before the therapy in the middle of the
before the therapy in the middle of the therapy and at the end of the therapy
therapy and at the end of the therapy now you want to know if there is a
now you want to know if there is a difference between the different times
difference between the different times so your independent variable is time or
so your independent variable is time or therapy progressing over time your
therapy progressing over time your dependent variable is the pain
dependent variable is the pain perception you now have a history of the
perception you now have a history of the pain perception of each person over time
pain perception of each person over time and now you want to know whether the
and now you want to know whether the therapy has an influence on the pain
therapy has an influence on the pain perception simplified in this case the
perception simplified in this case the therapy has an influence and in that
therapy has an influence and in that that case the therapy has no influence
that case the therapy has no influence on the pain perception in the course of
on the pain perception in the course of time the pain perception does not change
time the pain perception does not change here in this case it
here in this case it does now we also have a good transition
does now we also have a good transition to the
to the hypotheses in the Freedman test the null
hypotheses in the Freedman test the null hypothesis is there are no significant
hypothesis is there are no significant differences between the dependent groups
differences between the dependent groups and the alternative hypothesis is there
and the alternative hypothesis is there is a significant difference between the
is a significant difference between the dependent groups of course as already
dependent groups of course as already mentioned the Freedman test does not use
mentioned the Freedman test does not use the True Values but the ranks we will go
the True Values but the ranks we will go through the formula behind the Freedman
through the formula behind the Freedman test in a moment this brings us to the
test in a moment this brings us to the point of how to calculate the fredman
point of how to calculate the fredman test for the calculation of the Freedman
test for the calculation of the Freedman test you can of course simply use data
test you can of course simply use data or calculate it by hand to be honest
or calculate it by hand to be honest hardly anyone will calculate the fredman
hardly anyone will calculate the fredman test by hand but it will help you to
test by hand but it will help you to understand how the fredman test works
understand how the fredman test works and don't worry it's not that
and don't worry it's not that complicated first I will show you how to
complicated first I will show you how to calculate the fredman test with data Tab
calculate the fredman test with data Tab and then I will show you how to do it by
and then I will show you how to do it by hand in order to do this simply go to
hand in order to do this simply go to data.net and copy your own data into
data.net and copy your own data into this table let's say you want to
this table let's say you want to investigate whether there is a
investigate whether there is a difference in the response time of
difference in the response time of people in the morning at noon and in the
people in the morning at noon and in the evening we simply click on this tab
evening we simply click on this tab under this tab you will find many
under this tab you will find many hypothesis tests and data tab will
hypothesis tests and data tab will automatically suggest an appropriate
automatically suggest an appropriate test if we click on all three variables
test if we click on all three variables morning noon and evening data tab will
morning noon and evening data tab will automatically calculate an analysis of
automatically calculate an analysis of variance with repeated measures but in
variance with repeated measures but in our case we want to calculate the
our case we want to calculate the nonparametric test so we click here now
nonparametric test so we click here now now we get the results for the Freedman
now we get the results for the Freedman test up here you can read the
test up here you can read the descriptive
descriptive statistics and down here you can find
statistics and down here you can find the P value if you don't know exactly
the P value if you don't know exactly how to interpret the P value you can
how to interpret the P value you can just read the interpretation in words
just read the interpretation in words down here a fre has showed that there is
down here a fre has showed that there is no significant difference between the
no significant difference between the variables Kai Square =
variables Kai Square = 2.57 p = 0
2.57 p = 0 .276 if your P value is greater than
.276 if your P value is greater than your set significance level then your
your set significance level then your null hypothesis is retained the null
null hypothesis is retained the null hypothesis is that there is no
hypothesis is that there is no difference between the groups usually a
difference between the groups usually a significance level of 0.05 is used so
significance level of 0.05 is used so this P value is greater than
this P value is greater than 0.05 additionally data gives you the
0.05 additionally data gives you the post Hawk test if your p value is
post Hawk test if your p value is smaller than
smaller than 0.05 the post talk test helps you to
0.05 the post talk test helps you to examine which of the groups really
examine which of the groups really differ so now let's look at what the
differ so now let's look at what the equation behind the Freedman test are
equation behind the Freedman test are and recalculate this example by hand
and recalculate this example by hand here we have the measured values of the
here we have the measured values of the seven people in the first step we have
seven people in the first step we have to assign ranks to the values in order
to assign ranks to the values in order to do this we look at each rows
to do this we look at each rows separately in the first row which is the
separately in the first row which is the first person 45 is the largest value
first person 45 is the largest value this gets rank one then comes 36 with
this gets rank one then comes 36 with rank two and 34 with rank three we do
rank two and 34 with rank three we do the same for the second row here 36 is
the same for the second row here 36 is the largest value and gets rank one then
the largest value and gets rank one then comes 33 with rank two and 31 with rank
comes 33 with rank two and 31 with rank three
three now we do this for each row so for all
now we do this for each row so for all people afterwards we can calculate the
people afterwards we can calculate the rank sum for each point in time for
rank sum for each point in time for example we just sum up all ranks at one
example we just sum up all ranks at one point in time in the morning we get 17
point in time in the morning we get 17 at noon 11 and in the evening
at noon 11 and in the evening 14 if there were no differences between
14 if there were no differences between the different time points in terms of
the different time points in terms of reaction time we would expect the
reaction time we would expect the expected value at all time points the
expected value at all time points the expected value is obtained with this
expected value is obtained with this formula and in this case it is 14 so if
formula and in this case it is 14 so if there is no difference between morning
there is no difference between morning noon and evening we would actually
noon and evening we would actually expect a ranksum of 14 at all three time
expect a ranksum of 14 at all three time points next we can calculate the kai
points next we can calculate the kai Square value we get it with this formula
Square value we get it with this formula n is the number of people which is which
n is the number of people which is which is seven K is the number of time points
is seven K is the number of time points so three and the sum of R SAR is 17
so three and the sum of R SAR is 17 squared + 11 SAR + 14 squar so we get a
squared + 11 SAR + 14 squar so we get a Kai Square value of
Kai Square value of 2.57 now we need the number of degrees
2.57 now we need the number of degrees of freedom this is given by the number
of freedom this is given by the number of time points minus one so in our case
of time points minus one so in our case two finally we can read the critical Kai
two finally we can read the critical Kai Square value in the table of critical
Square value in the table of critical Kai Square values for this we take the
Kai Square values for this we take the predefined significance level let's say
predefined significance level let's say it is
it is 0.05 and the number of degrees of
0.05 and the number of degrees of freedom here we can read that the
freedom here we can read that the critical Kai Square value is
critical Kai Square value is 5.99 this is greater than our calculated
5.99 this is greater than our calculated value therefore the null hypothesis is
value therefore the null hypothesis is not rejected and based on these data
not rejected and based on these data there is no difference between the
there is no difference between the responsiveness at a different time
responsiveness at a different time points therefore the null hypothesis is
points therefore the null hypothesis is not rejected and based on these data
not rejected and based on these data there's no difference between the
there's no difference between the responsiveness at a different time
responsiveness at a different time points if the calculated Kai Square
points if the calculated Kai Square value was greater than the critical one
value was greater than the critical one we would reject the null
we would reject the null hypothesis having explored various
hypothesis having explored various nonparametric tests for ordinal or
nonparametric tests for ordinal or nonnormal distributed matric data now
nonnormal distributed matric data now let's shift our Focus to nominal data
let's shift our Focus to nominal data when our variables are nominal such as
when our variables are nominal such as gender or color preferences we require a
gender or color preferences we require a different kind of statistical test the
different kind of statistical test the kai Square test the kai Square test is a
kai Square test the kai Square test is a powerful tool for analyzing nominal data
powerful tool for analyzing nominal data let's get started what is a Kai Square
let's get started what is a Kai Square test and how is the kai Square test
test and how is the kai Square test calculat
calculat that's what we will discuss in this
that's what we will discuss in this video let's start with the first
video let's start with the first question what is a Kai Square test the
question what is a Kai Square test the kai Square test is a hypothesis test
kai Square test is a hypothesis test that is used when you want to determine
that is used when you want to determine if there is a relationship between two
if there is a relationship between two categorical variables what are
categorical variables what are categorical variables again categorical
categorical variables again categorical variables are for example gender with
variables are for example gender with the categories male and female the
the categories male and female the preferred newspaper with the CATE cies
preferred newspaper with the CATE cies USA Today The Wall Street Journal the
USA Today The Wall Street Journal the New York Times and so on or the highest
New York Times and so on or the highest educational level with the categories
educational level with the categories without graduation College bachelor's
without graduation College bachelor's degree master's degree so gender
degree master's degree so gender preferred newspaper and highest
preferred newspaper and highest educational level are all categorical
educational level are all categorical variables for example no categorical
variables for example no categorical variables are the weight of a person the
variables are the weight of a person the salary of a person or the power consum
salary of a person or the power consum assumption if we now have two
assumption if we now have two categorical variables and we want to
categorical variables and we want to test whether there is a relationship we
test whether there is a relationship we use a Kai Square test for example is
use a Kai Square test for example is there a relationship between gender and
there a relationship between gender and the preferred newspaper we have two
the preferred newspaper we have two categorical variables so we use a Kai
categorical variables so we use a Kai Square test another example is there a
Square test another example is there a relationship between perferred newspaper
relationship between perferred newspaper and highest educational level here again
and highest educational level here again we have two categorical variables so we
we have two categorical variables so we use a Kai Square test however there are
use a Kai Square test however there are two things to note first the Assumption
two things to note first the Assumption for the kai Square test is that the
for the kai Square test is that the expected frequencies per cell are
expected frequencies per cell are greater than five we'll go over what
greater than five we'll go over what that means in a moment second the Ki
that means in a moment second the Ki Square test uses only the categories and
Square test uses only the categories and not the rankings however in the case of
not the rankings however in the case of the highest educational level there's a
the highest educational level there's a ranking of categories if you want to
ranking of categories if you want to account for rankings check out our
account for rankings check out our tutorials on Spearman correlation manw U
tutorials on Spearman correlation manw U test or cross Wallace test but how do we
test or cross Wallace test but how do we calculate the kai Square test let's go
calculate the kai Square test let's go through that with an example we would
through that with an example we would like to investigate whether gender has
like to investigate whether gender has an influence on the preferred newspaper
an influence on the preferred newspaper so our question is is there a
so our question is is there a relationship between gender and the
relationship between gender and the preferred newspaper our null hypothesis
preferred newspaper our null hypothesis is there is no relationship between
is there is no relationship between gender and the preferred newspaper and
gender and the preferred newspaper and our alternative hypothesis is there is a
our alternative hypothesis is there is a relationship between gender and the
relationship between gender and the preferred newspaper so first we create a
preferred newspaper so first we create a questionnaire that asks about gender and
questionnaire that asks about gender and the preferred newspaper we will then
the preferred newspaper we will then send out the questionnaire the results
send out the questionnaire the results of the survey are displayed in a table
of the survey are displayed in a table in this table we see one respondent in
in this table we see one respondent in each row the first respondent is male
each row the first respondent is male and stated New York Post the second
and stated New York Post the second respondent is female and stated USA
respondent is female and stated USA Today We can now copy this table into a
Today We can now copy this table into a statistic software like data deab data
statistic software like data deab data daab then gives us the so-called
daab then gives us the so-called contingency table in this table you can
contingency table in this table you can see the variable newspaper and the
see the variable newspaper and the variable gender
variable gender the number of times each combination
the number of times each combination occurs is plotted in the cells for
occurs is plotted in the cells for example in this survey There are 16
example in this survey There are 16 people who stated New York Post and male
people who stated New York Post and male or 13 people who stated female and New
or 13 people who stated female and New York Post now we want to know if gender
York Post now we want to know if gender has an influence on the preferred
has an influence on the preferred newspaper or put another way is there a
newspaper or put another way is there a relationship between gender and the
relationship between gender and the preferred newspaper now to answer this
preferred newspaper now to answer this question question we use the kai Square
question question we use the kai Square test there are two ways we can calculate
test there are two ways we can calculate the kai Square test either we use a
the kai Square test either we use a statistical software like data tab or we
statistical software like data tab or we calculate the kai Square test by hand we
calculate the kai Square test by hand we start with the uncomplicated variant and
start with the uncomplicated variant and use data tab if you like you can load
use data tab if you like you can load the sample data set for calculation you
the sample data set for calculation you can find the link in the video
can find the link in the video description to calculate a Kai Square
description to calculate a Kai Square test online simply copy your own data
test online simply copy your own data into this t table or use the link to
into this t table or use the link to load this data set then the variables
load this data set then the variables gender and newspaper appear here below
gender and newspaper appear here below now we click on hypothesis test here you
now we click on hypothesis test here you will find a variety of tests and data
will find a variety of tests and data tab will help you to choose the right
tab will help you to choose the right one for example if we click on chender
one for example if we click on chender and newspaper the kai Square test will
and newspaper the kai Square test will be automatically
be automatically calculated now we get the results for
calculated now we get the results for the kai Square test above we see the
the kai Square test above we see the contingency table for the variables
contingency table for the variables gender and newspaper the contingency
gender and newspaper the contingency table shows us how often the respective
table shows us how often the respective combinations occur in our survey female
combinations occur in our survey female and USA Today for example occurs six
and USA Today for example occurs six times in the second table we can see
times in the second table we can see what the contingency table should
what the contingency table should actually look like if the two variables
actually look like if the two variables were perfectly independent that is if
were perfectly independent that is if gender had no influence on the preferred
gender had no influence on the preferred newspaper here it is important to note
newspaper here it is important to note that all of the frequencies should be
that all of the frequencies should be larger than five so that the assumptions
larger than five so that the assumptions of the kai Square test are fulfilled but
of the kai Square test are fulfilled but this is the case here the kai Square
this is the case here the kai Square test now Compares this table with that
test now Compares this table with that table and here we see the results the P
table and here we see the results the P value is
value is 0.91 which is much higher than our
0.91 which is much higher than our significance level of
significance level of 0.05 and therefore we keep the null
0.05 and therefore we keep the null hypothesis if you don't know exactly how
hypothesis if you don't know exactly how to interpret the results just click on
to interpret the results just click on summary in words a Kai Square test was
summary in words a Kai Square test was performed between gender and newspaper
performed between gender and newspaper all expected cell frequences were
all expected cell frequences were greater than five thus the assumptions
greater than five thus the assumptions for the kai Square test were met there
for the kai Square test were met there was no statistically significant
was no statistically significant relationship between gender and
relationship between gender and newspaper this results in a P value of
newspaper this results in a P value of 0.918 which is above the defined
0.918 which is above the defined significance level of 5% the kai Square
significance level of 5% the kai Square test is therefore not significant and
test is therefore not significant and the null hypothesis is not rejected if
the null hypothesis is not rejected if you're unsure what exactly the P value
you're unsure what exactly the P value means just watch our video about the P
means just watch our video about the P value and now we come to the question
value and now we come to the question how to calculate the kai Square test by
how to calculate the kai Square test by hand and we go through the formulas
hand and we go through the formulas needed don't worry it's not difficult we
needed don't worry it's not difficult we need the contingency table with the
need the contingency table with the observed frequencies the contingency
observed frequencies the contingency table with the expected frequencies that
table with the expected frequencies that is those frequencies that would occur
is those frequencies that would occur with perfectly independent variables you
with perfectly independent variables you can find how to calculate the expected
can find how to calculate the expected frequencies on data tab in the tutorial
frequencies on data tab in the tutorial on the kai Square test we can now
on the kai Square test we can now calculate the kai square with this
calculate the kai square with this formula the index k stands for the
formula the index k stands for the respective cell o is the observed
respective cell o is the observed frequency e k is the expected frequency
frequency e k is the expected frequency so we get 6 -
so we get 6 - 6.08 squared divided by
6.08 squared divided by 6.08 plus the next cell 7 -
6.08 plus the next cell 7 - 6.92 SAR divided by
6.92 SAR divided by 6.92 if we do this for all cell
6.92 if we do this for all cell and sum them up we get a Kai square of
0.504 so this results in a Kai Square value of
value of 0.504 now we would like to calculate the
0.504 now we would like to calculate the critical Kai Square value what do we
critical Kai Square value what do we need it for if we use a statistical
need it for if we use a statistical software we simply get a P value
software we simply get a P value displayed if the value is smaller than
displayed if the value is smaller than the significance level for example
the significance level for example 0.05 the the null hypothesis is rejected
0.05 the the null hypothesis is rejected otherwise not in our example case the
otherwise not in our example case the null hypothesis is not rejected by hand
null hypothesis is not rejected by hand however you can't really calculate the P
however you can't really calculate the P value therefore you read off in a table
value therefore you read off in a table which Kai Square value you would get
which Kai Square value you would get with a P value of
with a P value of 0.05 this Kai Square value is called the
0.05 this Kai Square value is called the critical Kai Square value in order to
critical Kai Square value in order to calculate the critical Kai Square value
calculate the critical Kai Square value we need the degrees of freedom these are
we need the degrees of freedom these are obtained by taking the number of rows
obtained by taking the number of rows minus one times the number of columns
minus one times the number of columns minus one we have four rows and two
minus one we have four rows and two columns therefore we get 3 * 1 and thus
columns therefore we get 3 * 1 and thus three degrees of freedom now let's take
three degrees of freedom now let's take a look at the table of critical Kai
a look at the table of critical Kai Square values you can find this table on
Square values you can find this table on data tab the link is in the video
data tab the link is in the video description we select a significance
description we select a significance level of
level of 0.05 and have three degrees of freedom
0.05 and have three degrees of freedom therefore our critical Kai Square value
therefore our critical Kai Square value is
is 7.81 15 the critical Kai Square value of
7.81 15 the critical Kai Square value of 7.81 five is larger than our calculated
7.81 five is larger than our calculated Kai Square value of
Kai Square value of 0.504 thus the null hypothesis is
0.504 thus the null hypothesis is retained up until now we focused on
retained up until now we focused on statistical tests designed to compare
statistical tests designed to compare groups or categories another fundamental
groups or categories another fundamental aspect of data analysis is understanding
aspect of data analysis is understanding the relationship between
the relationship between variables this is where correlation
variables this is where correlation analysis comes into play Let's
analysis comes into play Let's transition from tests of differences to
transition from tests of differences to measures of Association in the following
measures of Association in the following video this video is about correlation
video this video is about correlation analysis
analysis we start by asking what a correlation
we start by asking what a correlation analysis is we will then look at the
analysis is we will then look at the most important correlation analysis
most important correlation analysis Pearson correlation Spearman correlation
Pearson correlation Spearman correlation candles Tow and Point by zerial
candles Tow and Point by zerial correlation and finally we will discuss
correlation and finally we will discuss the difference between correlation and
the difference between correlation and cation let's start with the first
cation let's start with the first question what is a correlation analysis
question what is a correlation analysis correlation analysis is a statistical
correlation analysis is a statistical method used to measure meure the
method used to measure meure the relationship between two variables for
relationship between two variables for example is there a relationship between
example is there a relationship between a person's salary and age in this CER
a person's salary and age in this CER plot every single point is a person in
plot every single point is a person in correlation analysis we usually want to
correlation analysis we usually want to know two things number one how strong
know two things number one how strong the correlation is and number two in
the correlation is and number two in which direction the correlation goes we
which direction the correlation goes we can read both in the correlation
can read both in the correlation coefficient which is is between min-1
coefficient which is is between min-1 and 1 the strength of the correlation
and 1 the strength of the correlation can be read in a table if R is between 0
can be read in a table if R is between 0 and 0.1 we speak of no correlation if R
and 0.1 we speak of no correlation if R is between 0.7 and one we speak of a
is between 0.7 and one we speak of a very strong correlation a positive
very strong correlation a positive correlation exists when high values of
correlation exists when high values of one variable go along with high values
one variable go along with high values of the other variable or when small
of the other variable or when small values of one variable go along with
values of one variable go along with small values of the other variable a
small values of the other variable a positive correlation is found for
positive correlation is found for example for body size and shoe size the
example for body size and shoe size the result is a positive correlation
result is a positive correlation coefficient a negative correlation
coefficient a negative correlation exists when high values of one variable
exists when high values of one variable go along with low values of the other
go along with low values of the other variable and vice versa a negative
variable and vice versa a negative correlation usually exists between
correlation usually exists between product price price and sales volume the
product price price and sales volume the result is a negative correlation
result is a negative correlation coefficient now we have different
coefficient now we have different correlation coefficients the most
correlation coefficients the most popular are peon correlation coefficient
popular are peon correlation coefficient R Spearman correlation coefficient RS
R Spearman correlation coefficient RS canel Tow and Point by zal correlation
canel Tow and Point by zal correlation coefficient
coefficient rpb let's start with the first the peon
rpb let's start with the first the peon correlation coefficient what is a peon
correlation coefficient what is a peon correlation as all all correlation
correlation as all all correlation coefficients the pon correlation R is a
coefficients the pon correlation R is a statistical measure that quantifies the
statistical measure that quantifies the relationship between two variables in
relationship between two variables in the case of pearon correlation the
the case of pearon correlation the linear relationship of matric variables
linear relationship of matric variables is measured more about metric variables
is measured more about metric variables later so with the help of peon
later so with the help of peon correlation we can measure the linear
correlation we can measure the linear relationship between two variables and
relationship between two variables and of course the peon correlation
of course the peon correlation coefficient R tells tells us how strong
coefficient R tells tells us how strong the correlation is and in which
the correlation is and in which direction the correlation goes how is p
direction the correlation goes how is p and correlation calculated the p and
and correlation calculated the p and correlation coefficient is obtained via
correlation coefficient is obtained via this equation where R is the p and
this equation where R is the p and correlation coefficient x i are the
correlation coefficient x i are the individual values of one variable for
individual values of one variable for example h y i are the individual values
example h y i are the individual values of the other variable for example cell
of the other variable for example cell X Das and Y dash are respectively the
X Das and Y dash are respectively the mean values of the two variables in the
mean values of the two variables in the equation we can see that the respective
equation we can see that the respective mean value is first substracted from
mean value is first substracted from both values so in our example we
both values so in our example we calculate the mean values of age and
calculate the mean values of age and salary we then subtract the mean values
salary we then subtract the mean values from each person's age and
from each person's age and salary then we multiply both Val values
salary then we multiply both Val values and we sum up the individual results of
and we sum up the individual results of the multiplication the expression in the
the multiplication the expression in the denominator ensures that the correlation
denominator ensures that the correlation coefficient is scaled between minus1 and
coefficient is scaled between minus1 and 1 if we now multiply two positive values
1 if we now multiply two positive values we get a positive value so all values
we get a positive value so all values that lie in this area have a positive
that lie in this area have a positive influence on the correlation coefficient
influence on the correlation coefficient if we multiply two negative values we
if we multiply two negative values we also get a positive value minus * minus
also get a positive value minus * minus is plus so all values that lie in this
is plus so all values that lie in this area also have a positive influence on
area also have a positive influence on the correlation coefficient if we
the correlation coefficient if we multiply a positive value and a negative
multiply a positive value and a negative value we get a negative value minus *
value we get a negative value minus * plus is minus so all values that lie in
plus is minus so all values that lie in these ranges have a negative influence
these ranges have a negative influence on the correlation coefficient therefore
on the correlation coefficient therefore if our values are predominantly in these
if our values are predominantly in these two areas we get a positive correlation
two areas we get a positive correlation coefficient and thus a positive
coefficient and thus a positive relationship if our values are
relationship if our values are predominantly in these two areas we get
predominantly in these two areas we get a negative correlation coefficient and
a negative correlation coefficient and thus a negative relationship if the
thus a negative relationship if the points are distributed over all four
points are distributed over all four areas the positive terms and the
areas the positive terms and the negative terms cancel each other out and
negative terms cancel each other out and we get a very small or no correlation
we get a very small or no correlation but now there's one more thing to
but now there's one more thing to consider the correlation coefficient is
consider the correlation coefficient is usually calculated with data taken from
usually calculated with data taken from a sample however we often want to test a
a sample however we often want to test a hypothesis about a population in the
hypothesis about a population in the case of correlation analysis we then
case of correlation analysis we then want to know if there is a correlation
want to know if there is a correlation in the population for this we check
in the population for this we check whether the correlation coefficient in
whether the correlation coefficient in the sample is the statistically
the sample is the statistically significantly different from zero the
significantly different from zero the null hypothesis in the Pearson
null hypothesis in the Pearson correlation is the correlation
correlation is the correlation coefficient does not differ
coefficient does not differ significantly from zero there is no
significantly from zero there is no linear relationship and the alternative
linear relationship and the alternative hypothesis is the correlation
hypothesis is the correlation coefficient differs significantly from
coefficient differs significantly from zero there is a linear relationship
zero there is a linear relationship attention it is always tested whether
attention it is always tested whether the null hypothesis is rejected or not
the null hypothesis is rejected or not in our example the research question is
in our example the research question is is there a correlation between age and
is there a correlation between age and salary in the British population to find
salary in the British population to find out we draw a sample and test whether in
out we draw a sample and test whether in this sample the correlation coefficient
this sample the correlation coefficient is significantly different from zero the
is significantly different from zero the null hypothesis then is there is no
null hypothesis then is there is no correlation between salary and age in
correlation between salary and age in the British population and the
the British population and the alternative hypothesis there is a
alternative hypothesis there is a correlation between salary and age in
correlation between salary and age in the British population whether the
the British population whether the correlation coefficient is significantly
correlation coefficient is significantly different from zero based on the sample
different from zero based on the sample collected can be checked using a T Test
collected can be checked using a T Test where R is the correlation coefficient
where R is the correlation coefficient and N is the sample size A P value can
and N is the sample size A P value can then be calculated from the test
then be calculated from the test statistic T if the P value is smaller
statistic T if the P value is smaller than the specified significance level
than the specified significance level which is usually 5% % then the null
which is usually 5% % then the null hypothesis is rejected otherwise it is
hypothesis is rejected otherwise it is not but what about the assumptions for a
not but what about the assumptions for a Pon correlation here we must distinguish
Pon correlation here we must distinguish whether we just want to calculate the
whether we just want to calculate the peon correlation or whether we want to
peon correlation or whether we want to test a hypothesis to calculate the peon
test a hypothesis to calculate the peon correlation coefficient only two matric
correlation coefficient only two matric variables need to be present matric
variables need to be present matric variables are for example a person's
variables are for example a person's weight a person's salary or electric it
weight a person's salary or electric it consumption the peon correlation
consumption the peon correlation coefficient then tells us how large the
coefficient then tells us how large the linear relationship is if there is a
linear relationship is if there is a nonlinear relationship we cannot tell
nonlinear relationship we cannot tell from the peon correlation coefficient
from the peon correlation coefficient however if we want to test whether the
however if we want to test whether the peon correlation coefficient is
peon correlation coefficient is significantly different from zero the
significantly different from zero the two variables must always be normally
two variables must always be normally distributed if this is not given the
distributed if this is not given the calculated test statistic t or the P
calculated test statistic t or the P value cannot be interpreted reliably
value cannot be interpreted reliably let's continue with the Spearman
let's continue with the Spearman correlation the Spearman rank
correlation the Spearman rank correlation is the nonparametric
correlation is the nonparametric counterpart of the pearon correlation
counterpart of the pearon correlation but there is an important difference
but there is an important difference between both correlation coefficients
between both correlation coefficients Spearman correlation does not use the
Spearman correlation does not use the raw data but the ranks of the data let's
raw data but the ranks of the data let's look at this with an example we measure
look at this with an example we measure the reaction time of eight computer
the reaction time of eight computer players and ask their age when we
players and ask their age when we calculate a peon correlation we simply
calculate a peon correlation we simply take the two variables reaction time and
take the two variables reaction time and age and calculate the peering
age and calculate the peering correlation coefficient however we now
correlation coefficient however we now want to calculate the spe and rank
want to calculate the spe and rank correlation so first we assign a rank to
correlation so first we assign a rank to each person for reaction time and age
each person for reaction time and age the reaction time is already sorted by
the reaction time is already sorted by size 12 is the smallest value so gets
size 12 is the smallest value so gets rank one 15 the second smallest value so
rank one 15 the second smallest value so gets rank two and so on and so forth we
gets rank two and so on and so forth we are now doing the same with h here we
are now doing the same with h here we have the smallest value there the second
have the smallest value there the second smallest there the third smallest fourth
smallest there the third smallest fourth smallest and so on and so forth let's
smallest and so on and so forth let's take a look at this in the skatter plot
take a look at this in the skatter plot here we see the raw data of age and
here we see the raw data of age and reaction time but but now we would like
reaction time but but now we would like to use the rankings so we form ranks
to use the rankings so we form ranks from the variables age and reaction
from the variables age and reaction time through this transformation we have
time through this transformation we have now distributed the data more evenly to
now distributed the data more evenly to calculate the pearon correlation we
calculate the pearon correlation we simply calculate the peon correlation
simply calculate the peon correlation from the ranks so the Spearman
from the ranks so the Spearman correlation is equal to the pon
correlation is equal to the pon correlation only that the ranks are used
correlation only that the ranks are used instead of the raw values let's have a
instead of the raw values let's have a quick look at that in data tab here we
quick look at that in data tab here we have the reaction time and age and there
have the reaction time and age and there we have the just created ranks of
we have the just created ranks of reaction time and age now we can either
reaction time and age now we can either calculate spean correlation of reaction
calculate spean correlation of reaction time and AG where we get a correlation
time and AG where we get a correlation of
of 0.9 or we can calculate pieron
0.9 or we can calculate pieron correlation from the ranks where we also
correlation from the ranks where we also get at 0.9 so exactly the same as before
get at 0.9 so exactly the same as before if you like you can download the data
if you like you can download the data set you can find the link in the video
set you can find the link in the video description if there are no rank ties we
description if there are no rank ties we can also use this equation to calculate
can also use this equation to calculate the pon correlation RS is the spean
the pon correlation RS is the spean correlation n is the number of cases and
correlation n is the number of cases and D is the difference in ranks between the
D is the difference in ranks between the two variables referring to our example
two variables referring to our example we get a different D's with this
we get a different D's with this 1 - 1 is = to 0 2 - 3 is -1 3 - 2 is 1
1 - 1 is = to 0 2 - 3 is -1 3 - 2 is 1 and so on now we Square the individual
and so on now we Square the individual D's and add them all up so the sum of d
D's and add them all up so the sum of d i squ is 8 n which is the number of
i squ is 8 n which is the number of people is eight if you put everything in
people is eight if you put everything in we get a correlation coefficient of
we get a correlation coefficient of 0.9 just like the pieron correlation
0.9 just like the pieron correlation coefficient R spe correlation
coefficient R spe correlation coefficient RS also varies between minus
coefficient RS also varies between minus one and one let's continue with candle's
one and one let's continue with candle's tow candle's tow is a correlation
tow candle's tow is a correlation coefficient and is thus a measure of the
coefficient and is thus a measure of the relationship between two variables but
relationship between two variables but what is the difference between peeron
what is the difference between peeron correlation and candles rank correlation
correlation and candles rank correlation in contrast to pearon correlation kless
in contrast to pearon correlation kless rank correlation is a nonparametric test
rank correlation is a nonparametric test procedure thus for the calculation of
procedure thus for the calculation of candle St the data need not be normally
candle St the data need not be normally distributed and the variables need only
distributed and the variables need only have ordinal scale levels exactly the
have ordinal scale levels exactly the same is true for the spe and rank
same is true for the spe and rank correlation right that's right candle
correlation right that's right candle tow is very similar to spearman's rank
tow is very similar to spearman's rank correlation coefficient however candle's
correlation coefficient however candle's tow should be preferred over spean
tow should be preferred over spean correlation if very few data with many
correlation if very few data with many rank ties are available
rank ties are available but how is candle toar calculated we can
but how is candle toar calculated we can calculate candle Tower with this formula
calculate candle Tower with this formula where C is the number of concordant
where C is the number of concordant pairs and D is the number of discordant
pairs and D is the number of discordant pairs what are concordant and discordant
pairs what are concordant and discordant pairs we will now go through this with
pairs we will now go through this with an example suppose two doctors are asked
an example suppose two doctors are asked to rank six patients according to the
to rank six patients according to the physical health one of the two doctors
physical health one of the two doctors is now defined as a reference and the
is now defined as a reference and the patients are sorted from 1 to six now
patients are sorted from 1 to six now the sorted ranks are matched with the
the sorted ranks are matched with the ranks of the other doctor EG the patient
ranks of the other doctor EG the patient who is in third place with the reference
who is in third place with the reference Doctor Is In fourth place with the other
Doctor Is In fourth place with the other doctor now using candle's tow we want to
doctor now using candle's tow we want to know if there is a correlation between
know if there is a correlation between the two rankings for the calculation of
the two rankings for the calculation of candle Tower we only need these ranks we
candle Tower we only need these ranks we now look at each individual Rank and
now look at each individual Rank and note whether the values below are
note whether the values below are smaller or greater than itself so we
smaller or greater than itself so we start at the first rank three one is
start at the first rank three one is smaller than three so gets a minus four
smaller than three so gets a minus four is greater so it gets a plus two is
is greater so it gets a plus two is smaller so it gets a minus six is
smaller so it gets a minus six is greater so it gets a plus and five is
greater so it gets a plus and five is also greater so it also gets a plus now
also greater so it also gets a plus now we do the same for one here of course
we do the same for one here of course each subsequent rank is greater than one
each subsequent rank is greater than one so we have a plus everywhere at rank
so we have a plus everywhere at rank four two is smaller and six and five are
four two is smaller and six and five are greater now we do this for rank two and
greater now we do this for rank two and rank six then we can easily calculate
rank six then we can easily calculate the number of concordant and discordant
the number of concordant and discordant pairs we get the number of concordant
pairs we get the number of concordant pairs by counting all the Plus in our
pairs by counting all the Plus in our example we have 11 plus in total we get
example we have 11 plus in total we get a number of this coordinate pairs by
a number of this coordinate pairs by counting through all the minus in our
counting through all the minus in our example we have a total of 4 minus C is
example we have a total of 4 minus C is thus 11 and D is four candle tow now is
thus 11 and D is four candle tow now is 11 - 4 ided by 11 + 4 and we get a
11 - 4 ided by 11 + 4 and we get a candle tow of
candle tow of 0.47 we get an alternative formula for
0.47 we get an alternative formula for kandle tow here with s is C minus D
kandle tow here with s is C minus D therefore 7 n is the number of cases I.E
therefore 7 n is the number of cases I.E 6 if we insert everything we also get 7
6 if we insert everything we also get 7 ided by 15 just like the peeron
ided by 15 just like the peeron correlation coefficient R candle tow
correlation coefficient R candle tow also varies between minus1 and + one we
also varies between minus1 and + one we have again calculated the correlation
have again calculated the correlation coefficient using data from a sample now
coefficient using data from a sample now we can test if the correlation
we can test if the correlation coefficient is significantly different
coefficient is significantly different from zero thus the null hypothesis is
from zero thus the null hypothesis is the correlation coefficient too is equal
the correlation coefficient too is equal to zero there is no relationship and the
to zero there is no relationship and the alternative hypothesis is the
alternative hypothesis is the correlation coefficient to is unequal to
correlation coefficient to is unequal to zero there is a relationship therefore
zero there is a relationship therefore we want to know if the correlation
we want to know if the correlation coefficient is significantly different
coefficient is significantly different from zero you can analyze this either by
from zero you can analyze this either by hand or with a software like data tab
hand or with a software like data tab for the calculation by hand we can use
for the calculation by hand we can use use the set distribution as an
use the set distribution as an approximation however for this we should
approximation however for this we should at least have 40 cases so the six cases
at least have 40 cases so the six cases from our example are actually too few we
from our example are actually too few we get the set value with this formula here
get the set value with this formula here we have Tow and N is the number of cases
we have Tow and N is the number of cases this brings us to the last correlation
this brings us to the last correlation analysis the point by zerial correlation
analysis the point by zerial correlation Point by zal correlation is a special
Point by zal correlation is a special case of Pon correlation and examines the
case of Pon correlation and examines the relationship between a dichotomous
relationship between a dichotomous variable and the metric variable what is
variable and the metric variable what is a dichotomus variable and what is a
a dichotomus variable and what is a matric variable a damous variable is a
matric variable a damous variable is a variable with two values for example
variable with two values for example gender with male and female or smoking
gender with male and female or smoking status with smoker and non-smoker a
status with smoker and non-smoker a matric variable is for example the
matric variable is for example the weight of a person the salary of a
weight of a person the salary of a person or the electricity consumption
person or the electricity consumption so if we have a dichotomus variable and
so if we have a dichotomus variable and a matric variable and we want to know if
a matric variable and we want to know if there is a relationship we can use a
there is a relationship we can use a Point by Zer correlation of course we
Point by Zer correlation of course we need to check the assumptions beforehand
need to check the assumptions beforehand but more about that later how is the
but more about that later how is the point by zerial correlation calculated
point by zerial correlation calculated as stated at the beginning the point by
as stated at the beginning the point by zerial correlation is a special case of
zerial correlation is a special case of the peon correlation but how can we
the peon correlation but how can we calculate the p correlation when a
calculate the p correlation when a variable is nominal let's look at this
variable is nominal let's look at this with an example let's say we are
with an example let's say we are interested in investigating the
interested in investigating the relationship between the number of ours
relationship between the number of ours studied for a test and the test result
studied for a test and the test result pass failed we've calculated data from a
pass failed we've calculated data from a sample of 20 students where 12 students
sample of 20 students where 12 students passed the test and eight students
passed the test and eight students failed we have recorded the number of
failed we have recorded the number of hours each student studed for the test
hours each student studed for the test to calculate the point by Zer
to calculate the point by Zer correlation we first need to convert the
correlation we first need to convert the test result into numbers we can assign a
test result into numbers we can assign a score of one to students who passed the
score of one to students who passed the test and a score of zero to students who
test and a score of zero to students who failed the test now we can either
failed the test now we can either calculate the pon correlation of time
calculate the pon correlation of time and test result or we use the equation
and test result or we use the equation for the point by zero correlation X1
for the point by zero correlation X1 Dash is the mean value of the people who
Dash is the mean value of the people who have passed pass and X2 Dash is the mean
have passed pass and X2 Dash is the mean value of the people who failed N1 is the
value of the people who failed N1 is the number of people who passed and N2 the
number of people who passed and N2 the number of people who failed and N is the
number of people who failed and N is the total number but whether we calculate
total number but whether we calculate the peon correlation or we use the
the peon correlation or we use the equation for the point by zerial
equation for the point by zerial correlation we get the same result both
correlation we get the same result both times let's take a quick look at this in
times let's take a quick look at this in data tab here we have the learning hours
data tab here we have the learning hours the test result was passed and fail fail
the test result was passed and fail fail and there the test result with 0er and
and there the test result with 0er and one we Define the test result with Zer
one we Define the test result with Zer and one as
and one as metric if we now go to correlation and
metric if we now go to correlation and calculate the peeron correlation for
calculate the peeron correlation for these two variables we get a correlation
these two variables we get a correlation coefficient of
coefficient of 0.31 if we calculate the point by zal
0.31 if we calculate the point by zal correlation for learning hours and the
correlation for learning hours and the exam result was passed and failed we
exam result was passed and failed we also get a correlation of
also get a correlation of 0.31 just like the pon correlation
0.31 just like the pon correlation coefficient R the point by zal
coefficient R the point by zal correlation coefficient rpb also varies
correlation coefficient rpb also varies between minus1 and 1 if we have a
between minus1 and 1 if we have a coefficient between minus1 and less than
coefficient between minus1 and less than one there is a negative correlation thus
one there is a negative correlation thus a negative relationship between the
a negative relationship between the variables if we have a coefficient
variables if we have a coefficient between greater than zero and one there
between greater than zero and one there is a positive correlation that is is a
is a positive correlation that is is a positive relationship between the two
positive relationship between the two variables if the result is zero we have
variables if the result is zero we have no correlation as always with the point
no correlation as always with the point by Zer correlation we can also check
by Zer correlation we can also check whether the correlation coefficient is
whether the correlation coefficient is significantly different from zero thus
significantly different from zero thus the null hypothesis is the correlation
the null hypothesis is the correlation coefficient R is equal to zero there is
coefficient R is equal to zero there is no relationship and the alternative
no relationship and the alternative hypothesis is the correlation
hypothesis is the correlation coefficient R is unequal to Z there is a
coefficient R is unequal to Z there is a relationship before we get to the
relationship before we get to the assumptions here's an interesting note
assumptions here's an interesting note when we compute a point by zerial
when we compute a point by zerial correlation we get the same P value as
correlation we get the same P value as when we compute a t test for independent
when we compute a t test for independent samples for the same data so whether we
samples for the same data so whether we test a correlation hypothesis with the
test a correlation hypothesis with the point by zero correlation or a
point by zero correlation or a difference hypothesis with the T Test we
difference hypothesis with the T Test we get the same P value now if we compute a
get the same P value now if we compute a test in data tab with these data and we
test in data tab with these data and we have the null hypothesis there is no
have the null hypothesis there is no difference between the groups failed and
difference between the groups failed and passed in terms of the variable learning
passed in terms of the variable learning hours we get a P value of
hours we get a P value of 0.179 and also if we calculate a point
0.179 and also if we calculate a point by zero correlation and have the null
by zero correlation and have the null hypothesis there is no correlation
hypothesis there is no correlation between learning hours and test results
between learning hours and test results we get a P value of 0.1
we get a P value of 0.1 179 in our example the P value is
179 in our example the P value is greater than
greater than 0.05 which is most often used as a
0.05 which is most often used as a significance level and thus the null
significance level and thus the null hypothesis is not rejected but what
hypothesis is not rejected but what about the assumptions for a point by zal
about the assumptions for a point by zal correlation here we must distinguish
correlation here we must distinguish whether we just want to calculate the
whether we just want to calculate the correlation coefficient or whether we
correlation coefficient or whether we want to test a hypothesis to calculate a
want to test a hypothesis to calculate a correlation coefficient only one matric
correlation coefficient only one matric variable and one damus variable must be
variable and one damus variable must be present however if we want to test
present however if we want to test whether the correlation coefficient is
whether the correlation coefficient is significantly different from zero the
significantly different from zero the one matric variable must also be
one matric variable must also be normally distributed if this is not
normally distributed if this is not given the calculated test statistic t or
given the calculated test statistic t or the P value cannot be interpreted
the P value cannot be interpreted reliably this brings us to the last
reliably this brings us to the last question what is causality and what is
question what is causality and what is the difference between causality and
the difference between causality and correlation causality is the
correlation causality is the relationship between a cause and an
relationship between a cause and an effect in a causal relationship we have
effect in a causal relationship we have a cause and a resultant effect an
a cause and a resultant effect an example coffee contains caffeine a
example coffee contains caffeine a stimulating substance when you drink
stimulating substance when you drink coffee the caffeine enters the body
coffee the caffeine enters the body affects the central nervous system and
affects the central nervous system and leads to increased alerted drinking
leads to increased alerted drinking coffee is the cause of the feeling of
coffee is the cause of the feeling of alerted that comes afterwards without
alerted that comes afterwards without drinking coffee the effect I.E the
drinking coffee the effect I.E the feeling of alerted would not occur but
feeling of alerted would not occur but causality is not always so easy to
causality is not always so easy to determine clear requirements must be met
determine clear requirements must be met in order to speak of a causal
in order to speak of a causal relationship but more about that later
relationship but more about that later so what is the difference between
so what is the difference between correlation and causality a correlation
correlation and causality a correlation tells us that there is a relationship
tells us that there is a relationship between two variables
between two variables example there is a positive correlation
example there is a positive correlation between ice cream sales and the number
between ice cream sales and the number of sunburns however an existing
of sunburns however an existing correlation cannot tell us which
correlation cannot tell us which variable influences which or whether a
variable influences which or whether a third variable is responsible for the
third variable is responsible for the correlation in our example both
correlation in our example both variables are influenced by a common
variables are influenced by a common cause namely sunny weather on sunny days
cause namely sunny weather on sunny days people buy more ice cream and and spend
people buy more ice cream and and spend more time Outdoors this can lead to an
more time Outdoors this can lead to an increased risk of sunburns causality
increased risk of sunburns causality means that there is a clear cause effect
means that there is a clear cause effect relationship between two variables
relationship between two variables causality exists when you can say with
causality exists when you can say with certainity which variable influences
certainity which variable influences which however a common mistake in the
which however a common mistake in the interpretation of Statistics is that a
interpretation of Statistics is that a correlation is immediately assumed to be
correlation is immediately assumed to be a causal relationship here is an example
a causal relationship here is an example example the American statistician Daryl
example the American statistician Daryl Huff found a negative correlation
Huff found a negative correlation between the number of headli and the
between the number of headli and the body temperature of the inhabitants of
body temperature of the inhabitants of an island a negative correlation means
an island a negative correlation means that people with many head lies
that people with many head lies generally have a lower body temperature
generally have a lower body temperature and people with few head lies generally
and people with few head lies generally have a higher body temperature the
have a higher body temperature the Islanders concluded that head lies were
Islanders concluded that head lies were good for health because they reduced
good for health because they reduced fever so their assumption was that
fever so their assumption was that headlights have an effect on the
headlights have an effect on the temperature of the body in reality the
temperature of the body in reality the correct conclusion is the other way
correct conclusion is the other way around in an experiment it was possible
around in an experiment it was possible to prove that high fever drives away the
to prove that high fever drives away the lies so the high body temperature is the
lies so the high body temperature is the cause not the effect what are the
cause not the effect what are the conditions for talking about causality
conditions for talking about causality there are two conditions for causality
there are two conditions for causality number one there is a significant
number one there is a significant correlation between the variables this
correlation between the variables this is easy to check we simply check whether
is easy to check we simply check whether the correlation coefficient is
the correlation coefficient is significantly different from zero number
significantly different from zero number two the second condition can be met in
two the second condition can be met in three ways first chronological sequence
three ways first chronological sequence there is a chronological sequence and
there is a chronological sequence and the results of one variable occurred
the results of one variable occurred before the results of the other variable
before the results of the other variable second experiment a controlled
second experiment a controlled experiment was conducted in which the
experiment was conducted in which the two variables can be specifically
two variables can be specifically influenced and number three Theory there
influenced and number three Theory there is a well-founded and plausible theory
is a well-founded and plausible theory in which direction the causal
in which direction the causal relationship goes if there is only a
relationship goes if there is only a significant correlation but none of the
significant correlation but none of the other three conditions are met we can
other three conditions are met we can only speak of correlation never of
only speak of correlation never of causality after examining how
causality after examining how correlation analysis helps us to
correlation analysis helps us to determine the extent to which variable
determine the extent to which variable are related we now move on into the
are related we now move on into the field of
field of regression first we'll start with an
regression first we'll start with an overview of regression analysis where we
overview of regression analysis where we will break down the fundamentals and
will break down the fundamentals and explore Its Real World application next
explore Its Real World application next we'll dive into simple linear regression
we'll dive into simple linear regression where you will learn how to model the
where you will learn how to model the relationship between two variables then
relationship between two variables then we will move on to the multiple linear
we will move on to the multiple linear regression where we extend the model to
regression where we extend the model to include multiple predictors making our
include multiple predictors making our predictions more powerful and finally
predictions more powerful and finally we'll cover logistic regression which is
we'll cover logistic regression which is essential when working with categorical
essential when working with categorical variables like predicting whether
variables like predicting whether something will happen or not so let's
something will happen or not so let's get started with the first question what
get started with the first question what is a regression analysis A regression
is a regression analysis A regression analysis is a method for modeling
analysis is a method for modeling relationships between variables it makes
relationships between variables it makes it possible to infer or predict a
it possible to infer or predict a variable based on one or more other
variable based on one or more other variables let's say you want to find out
variables let's say you want to find out what influences a person's salary for
what influences a person's salary for example you could take the highest level
example you could take the highest level of Education the weekly working hours
of Education the weekly working hours and the age of a person you could now
and the age of a person you could now investigate whether these three
investigate whether these three variables have an influence on the
variables have an influence on the salary of a person if they do you can
salary of a person if they do you can predict a person's salary by taking the
predict a person's salary by taking the highest level of Education weekly
highest level of Education weekly working hours and a person's age the
working hours and a person's age the variable we want to inere or predict is
variable we want to inere or predict is called the dependent variable the
called the dependent variable the variables used for prediction are called
variables used for prediction are called independent variables depending on your
independent variables depending on your field independent variables may also be
field independent variables may also be called predictor variables or input
called predictor variables or input variables while the dependent variable
variables while the dependent variable might be referred to as the response
might be referred to as the response output or Target variable okay but when
output or Target variable okay but when do we use a regression analysis
do we use a regression analysis regression analysis can be used to
regression analysis can be used to achieve two goals you can measure the
achieve two goals you can measure the influence of one variable or several
influence of one variable or several variables on another variable or you can
variables on another variable or you can predict a variable based on other
predict a variable based on other variables let's go through some examples
variables let's go through some examples let's start by measuring the influence
let's start by measuring the influence of one or more variables on another in
of one or more variables on another in the context of your research you may be
the context of your research you may be interested in understanding the factors
interested in understanding the factors that influence children 's ability to
that influence children 's ability to concentrate specifically you aim to
concentrate specifically you aim to determine whether certain parameters
determine whether certain parameters have a positive or negative impact on
have a positive or negative impact on their concentration but in this case
their concentration but in this case you're not interested in predicting
you're not interested in predicting children's ability to concentrate or you
children's ability to concentrate or you could investigate whether the
could investigate whether the educational level of the parents and the
educational level of the parents and the place of residence have an influence on
place of residence have an influence on the future educational level of children
the future educational level of children this area is therefore very research
this area is therefore very research based and has many applications in
based and has many applications in Social and economic Sciences the second
Social and economic Sciences the second area using regression for predictions is
area using regression for predictions is more application oriented to get the
more application oriented to get the most out of Hospital occupancy you might
most out of Hospital occupancy you might be interested in how long a patient will
be interested in how long a patient will stay in the hospital so based on the
stay in the hospital so based on the characteristics of the prospective
characteristics of the prospective patient such as age reason for stay and
patient such as age reason for stay and pre-existing conditions you want to know
pre-existing conditions you want to know how long that person is like likely to
how long that person is like likely to stay in the hospital based on this
stay in the hospital based on this prediction bath planning can then be
prediction bath planning can then be optimized or as an operator of an online
optimized or as an operator of an online store you are very interested in which
store you are very interested in which product a person is most likely to buy
product a person is most likely to buy you want to suggest this product to the
you want to suggest this product to the visitor in order to increase the sales
visitor in order to increase the sales of the online store second point is
of the online store second point is highly application oriented focusing on
highly application oriented focusing on making predictions to enhance efficiency
making predictions to enhance efficiency okay great now there are different types
okay great now there are different types of regression analysis there is the
of regression analysis there is the simple linear multiple linear and
simple linear multiple linear and logistic regression in simple linear
logistic regression in simple linear regression we use just one independent
regression we use just one independent variable to predict the dependent
variable to predict the dependent variable for example if we want to
variable for example if we want to predict a person's salary we use only
predict a person's salary we use only one variable either if a person has
one variable either if a person has studied or not the weekly working hours
studied or not the weekly working hours or the age of a person multiple linear
or the age of a person multiple linear regret on the other hand uses several
regret on the other hand uses several independent variables to predict or
independent variables to predict or inere the dependent variable I.E the
inere the dependent variable I.E the highest level of edication the number of
highest level of edication the number of hours worked per week and the age of the
hours worked per week and the age of the person therefore the difference between
person therefore the difference between a simple and the multiple regression is
a simple and the multiple regression is that in one case only one independent
that in one case only one independent variable is used and in the other case
variable is used and in the other case several both have in common that the
several both have in common that the dependent variable is matric matric
dependent variable is matric matric variable are for example the salary of a
variable are for example the salary of a person the body size or the electricity
person the body size or the electricity consumption in contrast logistic
consumption in contrast logistic regression is used when you have a
regression is used when you have a categorical dependent variable
categorical dependent variable categorical variables are for example if
categorical variables are for example if a person is at risk of burnout or not if
a person is at risk of burnout or not if a person is diseased or not or type of
a person is diseased or not or type of animal however the most common form of
animal however the most common form of logistic regression is the so-called
logistic regression is the so-called binary logistic regression
binary logistic regression in this case the outcome variable is
in this case the outcome variable is binary meaning it has two possible
binary meaning it has two possible values like yes or no success of failure
values like yes or no success of failure or diseased and not diseased therefore
or diseased and not diseased therefore in linear regression the dependent
in linear regression the dependent variable is a metric variable while in
variable is a metric variable while in logistic regression it is a categorical
logistic regression it is a categorical variable also known as nominal variable
variable also known as nominal variable but what about the independent variables
but what about the independent variables in all cases the level of measure M of
in all cases the level of measure M of the independent variables can be nominal
the independent variables can be nominal ordinal or metric okay actually in
ordinal or metric okay actually in regression you can only use categorical
regression you can only use categorical variables with two categories or levels
variables with two categories or levels such as gender with male and female in
such as gender with male and female in this case we can code one category with
this case we can code one category with zero and the other with one however if a
zero and the other with one however if a variable has more than two categories
variable has more than two categories like vehicle type there's an easy
like vehicle type there's an easy solution we create dummy variables don't
solution we create dummy variables don't worry we'll explain more about dummy
worry we'll explain more about dummy variables later in this playlist okay a
variables later in this playlist okay a quick recap there is the simple linear
quick recap there is the simple linear regression a question could be does the
regression a question could be does the weekly working time have an impact on
weekly working time have an impact on the hourly wage of employees here we
the hourly wage of employees here we only have one independent
only have one independent variable there is the multiple linear
variable there is the multiple linear regression do the weekly working hours
regression do the weekly working hours and the age of employees have an
and the age of employees have an influence on the hourly wage here we
influence on the hourly wage here we have at least two independent variables
have at least two independent variables in this case weekly working hours and
in this case weekly working hours and age and the last case logistic
age and the last case logistic regression du the weekly working hours
regression du the weekly working hours and the age of employees have an
and the age of employees have an influence on the probability that they
influence on the probability that they are at risk of burnout where burnout at
are at risk of burnout where burnout at risk has the categories yes or no now
risk has the categories yes or no now that we've covered the basics of
that we've covered the basics of regression analysis and its applications
regression analysis and its applications let's take a closer look at the simple
let's take a closer look at the simple linear regression this technique helps
linear regression this technique helps us to understand the relationship
us to understand the relationship between two variables allowing us to
between two variables allowing us to predict one based on the other in the
predict one based on the other in the next section we'll break down how it
next section we'll break down how it works and go through a real example and
works and go through a real example and show you how to interpret the results
show you how to interpret the results let's get started what is a simple
let's get started what is a simple linear regression simple linear
linear regression simple linear regression is a method to understand the
regression is a method to understand the relationship between two variables you
relationship between two variables you can infer or predict a variable based on
can infer or predict a variable based on another variable for example You can
another variable for example You can predict the annual salary of a person
predict the annual salary of a person based on the years of work experience so
based on the years of work experience so a simple linear regression can help us
a simple linear regression can help us understand how salary changes with an
understand how salary changes with an increase in years of experience the
increase in years of experience the variable we want to inere or predict is
variable we want to inere or predict is called the dependent variable the
called the dependent variable the variable we use for prediction is called
variable we use for prediction is called independent variable let's look at an
independent variable let's look at an example imagine we want to predict house
example imagine we want to predict house prices the dependent variable the one we
prices the dependent variable the one we want to predict is of course the price
want to predict is of course the price of the house the independent variable
of the house the independent variable the one we used to make the prediction
the one we used to make the prediction could be the size of the house in square
could be the size of the house in square feet of course you can also use more
feet of course you can also use more than one independent variable for
than one independent variable for example the construction year or
example the construction year or features like whether the house has a
features like whether the house has a swimming pool the number of bathrooms
swimming pool the number of bathrooms and the house size but in this case it
and the house size but in this case it would be a multiple linear regression
would be a multiple linear regression because you have more than one
because you have more than one independent variable more on that in my
independent variable more on that in my video about multiple linear regression
video about multiple linear regression okay but how do we calculate a simple
okay but how do we calculate a simple linear regression first of all we need
linear regression first of all we need data so we collect information from 10
data so we collect information from 10 houses including their size in square
houses including their size in square feet and the price they were sold for
feet and the price they were sold for now we can use use this data to
now we can use use this data to calculate our regression model here Y is
calculate our regression model here Y is the dependent variable house price and X
the dependent variable house price and X is the independent variable house size
is the independent variable house size and we want to use our data to determine
and we want to use our data to determine the coefficient b and a but how do we do
the coefficient b and a but how do we do that let's visualize our data using a
that let's visualize our data using a scatter plot on the xais we plot our
scatter plot on the xais we plot our independent variable the house size and
independent variable the house size and on the Y AIS we plot the dependent VAR
on the Y AIS we plot the dependent VAR variable the house price each point is
variable the house price each point is therefore one house with the respective
therefore one house with the respective house size and the house price okay now
house size and the house price okay now we want to summarize all this data using
we want to summarize all this data using a simple linear regression to do this we
a simple linear regression to do this we draw a straight line through the points
draw a straight line through the points on the scatter plot but the line we draw
on the scatter plot but the line we draw isn't just any random line it's the line
isn't just any random line it's the line that tries to minimize the error or the
that tries to minimize the error or the distance between the actual data points
distance between the actual data points and the line line itself if we add up
and the line line itself if we add up the length of all the red lines we get
the length of all the red lines we get the total error our goal is to find the
the total error our goal is to find the regression line that minimizes this
regression line that minimizes this error but how do we actually calculate
error but how do we actually calculate this line This is where the equation of
this line This is where the equation of the linear regression comes into play in
the linear regression comes into play in the equation B is the slope of the line
the equation B is the slope of the line the slope shows how much the house price
the slope shows how much the house price changes if the house size increases by
changes if the house size increases by one square squ food a is the Y intercept
one square squ food a is the Y intercept telling us where the line crosses the Y
telling us where the line crosses the Y AIS so if we have a house with a size of
AIS so if we have a house with a size of zero the model will predict a house
zero the model will predict a house price of a of course predicting the
price of a of course predicting the price of a house with zero size doesn't
price of a house with zero size doesn't make sense however every model is a
make sense however every model is a simplification of the real world and in
simplification of the real world and in the case of simple linear regression our
the case of simple linear regression our model is defined by a regression line
model is defined by a regression line with a specific slope B B and an
with a specific slope B B and an intercept a let's look at this example
intercept a let's look at this example in that case our intercept is 100 so we
in that case our intercept is 100 so we enter 100 for a but how do we read the
enter 100 for a but how do we read the slope for this we take a one unit step
slope for this we take a one unit step in the independent variable for example
in the independent variable for example if we move from 1 to two then we observe
if we move from 1 to two then we observe how much the dependent variable changes
how much the dependent variable changes with this one unit increase in this case
with this one unit increase in this case if the independent variable increases by
if the independent variable increases by one unit the dependent variable
one unit the dependent variable increases by 50 units so our B is
increases by 50 units so our B is 50 okay but how do we calculate b and a
50 okay but how do we calculate b and a there are two ways to do this we can
there are two ways to do this we can calculate them by hand or using
calculate them by hand or using statistical software like data tab let's
statistical software like data tab let's look at this example how can we
look at this example how can we calculate b and a by hand to calculate
calculate b and a by hand to calculate the slope B we use this formula R is the
the slope B we use this formula R is the correlation coefficient between X and Y
correlation coefficient between X and Y so in our case the correlation between
so in our case the correlation between house size and house price we get a
house size and house price we get a correlation coefficient of
correlation coefficient of 0.92 s y is the standard deviation of
0.92 s y is the standard deviation of the dependent variable house price and
the dependent variable house price and SX is the standard deviation of the
SX is the standard deviation of the independent variable so house size so in
independent variable so house size so in this case our B is 108
this case our B is 108 35 here's a quick side note for anyone
35 here's a quick side note for anyone who wants to follow along and
who wants to follow along and recalculate the example there are two
recalculate the example there are two slightly different formulas for standard
slightly different formulas for standard deviation one divides by n and the other
deviation one divides by n and the other by n minus one without diving into the
by n minus one without diving into the details now almost all statistical
details now almost all statistical software uses the formula with n minus1
software uses the formula with n minus1 to calculate the standard deviation
to calculate the standard deviation however for calculating the regression
however for calculating the regression coefficient B we use the formula that
coefficient B we use the formula that divides by n if you like a more detailed
divides by n if you like a more detailed explanation of the standard deviation
explanation of the standard deviation feel free to check out my video on that
feel free to check out my video on that topic all right once we've calculated B
topic all right once we've calculated B we can find The Intercept a using this
we can find The Intercept a using this formula here Y Bar represents the mean
formula here Y Bar represents the mean of the house prices B is the slope which
of the house prices B is the slope which is calculated and xar is the mean of the
is calculated and xar is the mean of the house sizes substituting these values
house sizes substituting these values the inter ccept a comes out to be
6,919 44 So based on this data we have now
44 So based on this data we have now calculated the coefficient b and a if we
calculated the coefficient b and a if we insert the numbers for b and a we get
insert the numbers for b and a we get this equation if we enter Z for X the
this equation if we enter Z for X the house size we get
house size we get 6,919 44 which is our intercept if we
6,919 44 which is our intercept if we increase the house size by One S square
increase the house size by One S square foot each time we get a house price that
foot each time we get a house price that is
$18.35 higher each time okay before we start with the last important topic the
start with the last important topic the assumptions for a simple linear
assumptions for a simple linear regression let's check the results with
regression let's check the results with data tab if you like you can load this
data tab if you like you can load this sample data set the link is in the video
sample data set the link is in the video description we want to calculate a
description we want to calculate a regression so we click on regression
regression so we click on regression here we can now select our dependent
here we can now select our dependent variable
variable house price and the independent variable
house price and the independent variable house size now let's look at the results
house size now let's look at the results we will focus on this table which shows
we will focus on this table which shows the key information we need if you're
the key information we need if you're curious about the other tables you can
curious about the other tables you can click on AI interpretation for the
click on AI interpretation for the corresponding table or check out our
corresponding table or check out our video on multiple linear regression
video on multiple linear regression where we explain these details in depth
where we explain these details in depth okay in this table we can see the
okay in this table we can see the calculated regression coefficient
calculated regression coefficient for the constant we called it intercept
for the constant we called it intercept and house size the values match exactly
and house size the values match exactly with the ones we calculated by hand the
with the ones we calculated by hand the intercept and the slope in the results
intercept and the slope in the results table for a linear regression you'll
table for a linear regression you'll also see the P value what does the P
also see the P value what does the P value tell us the P value helps
value tell us the P value helps determine whether the relationship
determine whether the relationship between the independent variable and a
between the independent variable and a dependent variable is statistically
dependent variable is statistically significant to to test whether the
significant to to test whether the relationship we observe is Meaningful or
relationship we observe is Meaningful or just due to random Chens we start by
just due to random Chens we start by stating the null hypothesis there is no
stating the null hypothesis there is no relationship between the independent
relationship between the independent variable and the dependent variable if
variable and the dependent variable if the P value is small typically smaller
the P value is small typically smaller than
than 0.05 we reject the null hypothesis this
0.05 we reject the null hypothesis this is suggesting a significant relationship
is suggesting a significant relationship between the variables if the P value is
between the variables if the P value is large typically great later than
large typically great later than 0.05 we fail to reject the null
0.05 we fail to reject the null hypothesis indicating The observed data
hypothesis indicating The observed data may have occurred by chance with no
may have occurred by chance with no strong evidence for a relationship so in
strong evidence for a relationship so in our case the P value is highly
our case the P value is highly significant indicating strong evidence
significant indicating strong evidence of a relationship between house price
of a relationship between house price and size all right in this example it's
and size all right in this example it's pretty obvious a bigger house size
pretty obvious a bigger house size typically costs more however there are
typically costs more however there are cases where the relationship isn't that
cases where the relationship isn't that clear and what about the assumptions
clear and what about the assumptions here are the key assumptions number one
here are the key assumptions number one linear
linear relationship in linear regression a
relationship in linear regression a straight line is drawn through the data
straight line is drawn through the data this straight line should represent All
this straight line should represent All Points as good as possible if the
Points as good as possible if the relation is nonlinear the straight line
relation is nonlinear the straight line cannot fulfill this requirement number
cannot fulfill this requirement number two independence of Errors the error so
two independence of Errors the error so the differences between actual and
the differences between actual and predicted values should be independent
predicted values should be independent of each other this means that the error
of each other this means that the error of one point doesn't affect another
of one point doesn't affect another number three
number three homoscedasticity or equal variance of
homoscedasticity or equal variance of Errors if we plot the errors on the Y
Errors if we plot the errors on the Y AIS and the dependent variable on the
AIS and the dependent variable on the xaxis their spread should be roughly the
xaxis their spread should be roughly the same across all values of X in other
same across all values of X in other words the variance of the error should
words the variance of the error should remain constant in this case the
remain constant in this case the assumption is fulfilled but what about
assumption is fulfilled but what about that case here we observe unequal
that case here we observe unequal variant at low values of X the errors
variant at low values of X the errors are small while at high values the
are small while at high values the variance of the errors becomes much
variance of the errors becomes much larger number four normally distributed
larger number four normally distributed errors the arrows should be normally
errors the arrows should be normally distributed the normality of the arrow
distributed the normality of the arrow can be tested both analytically and
can be tested both analytically and graphically however be cautious with
graphically however be cautious with analytical tests for small samples they
analytical tests for small samples they often indicate normality and with large
often indicate normality and with large samples they quickly become significant
samples they quickly become significant because of these limitations graphical
because of these limitations graphical methods such as the QQ plot are more
methods such as the QQ plot are more commonly used today if you use data tab
commonly used today if you use data tab you just need to click here to check the
you just need to click here to check the assumptions so let's just go through how
assumptions so let's just go through how the assumptions are checked in practice
the assumptions are checked in practice to check for a linear relationship you
to check for a linear relationship you can use a scatter plot plot the
can use a scatter plot plot the independent variable against the
independent variable against the dependent variable if the points form a
dependent variable if the points form a clear straight line pattern a linear
clear straight line pattern a linear relationship exists if not the
relationship exists if not the relationship is likely nonlinear in our
relationship is likely nonlinear in our case we observe a clear linear
case we observe a clear linear relationship to test if the errors are
relationship to test if the errors are normally distributed you can use a QQ
normally distributed you can use a QQ plot or one of the several analytical
plot or one of the several analytical tests with a QQ plot the residuals
tests with a QQ plot the residuals should fall roughly along a straight
should fall roughly along a straight line indicating normality if you use an
line indicating normality if you use an analytical test check whether the
analytical test check whether the calculated P value is greater than
calculated P value is greater than 0.05 if it is not there is evidence that
0.05 if it is not there is evidence that the data are not normally distributed
the data are not normally distributed the choice of test often depends on your
the choice of test often depends on your field of research however as mentioned
field of research however as mentioned the QQ plot is increasingly preferred as
the QQ plot is increasingly preferred as a visual and intuitive way to assess
a visual and intuitive way to assess normality independence of erors can be
normality independence of erors can be tested using the Durban Watson test
tested using the Durban Watson test which checks for autocorrelation in the
which checks for autocorrelation in the residuals if the calculated P value is
residuals if the calculated P value is greater than
greater than 0.05 it indicates that there is no
0.05 it indicates that there is no significant autocorrelation in the
significant autocorrelation in the residuals and the independence
residuals and the independence assumption is likely satisfied
assumption is likely satisfied homoscedasticity can be checked using a
homoscedasticity can be checked using a residual plot where the predicted values
residual plot where the predicted values are plotted on the xais and the
are plotted on the xais and the residuals or arrow on the Y AIS the
residuals or arrow on the Y AIS the residuals should show a consistent
residuals should show a consistent spread across the plot a funnel shape
spread across the plot a funnel shape indicates hat Tois causticity meaning
indicates hat Tois causticity meaning the variance is not constant in our case
the variance is not constant in our case the plot looks acceptable though not
the plot looks acceptable though not perfect if these assumptions are
perfect if these assumptions are violated the regression results might
violated the regression results might not be reliable or meaningful and the
not be reliable or meaningful and the predictions could be inaccurate so
predictions could be inaccurate so always check these assumptions before
always check these assumptions before drawing conclusions from a regression
drawing conclusions from a regression model so far we've seen how simple
model so far we've seen how simple linear regression helps us model the
linear regression helps us model the relationship between two variables one
relationship between two variables one dependent and one independent but what
dependent and one independent but what if we have more than one factor
if we have more than one factor influencing our outcome that's where
influencing our outcome that's where multiple linear regression comes in in
multiple linear regression comes in in the next section we'll extend our
the next section we'll extend our regression model to including multiple
regression model to including multiple predictors making our predictions more
predictors making our predictions more accurate and realistic let's dive in
accurate and realistic let's dive in what is a multiple linear regression a
what is a multiple linear regression a multiple linear regression is a method
multiple linear regression is a method for modeling relationships between
for modeling relationships between variables it makes it possible to inere
variables it makes it possible to inere or predict a variable based on other
or predict a variable based on other variables an example let's say you want
variables an example let's say you want to find out what influence uences a
to find out what influence uences a person's salary you take the highest
person's salary you take the highest level of Education the weekly working
level of Education the weekly working hours and the age of a person you now
hours and the age of a person you now investigate whether these three
investigate whether these three variables have an influence on the
variables have an influence on the salary of a person if they do you can
salary of a person if they do you can predict a person's salary by taking the
predict a person's salary by taking the highest educational level the weekly
highest educational level the weekly working hours and a person's age the
working hours and a person's age the variables we use for prediction are
variables we use for prediction are called independent variables the Vari we
called independent variables the Vari we want to inere or predict is called the
want to inere or predict is called the dependent variable but what is the
dependent variable but what is the difference between a simple linear and
difference between a simple linear and the multiple linear regression as we
the multiple linear regression as we know from the previous video in simple
know from the previous video in simple linear regression we use just one
linear regression we use just one independent variable to predict the
independent variable to predict the dependent variable for example if we
dependent variable for example if we want to predict a person's salary we use
want to predict a person's salary we use either if a person has studied or not
either if a person has studied or not the weekly working hours or the age of a
the weekly working hours or the age of a person multiple linear regression on the
person multiple linear regression on the other hand uses several independent
other hand uses several independent variables to predict or inere the
variables to predict or inere the dependent variable therefore the
dependent variable therefore the difference between a simple and a
difference between a simple and a multiple linear regression is that in
multiple linear regression is that in one case only one independent variable
one case only one independent variable is used and in the other case several
is used and in the other case several both have in common that the dependent
both have in common that the dependent variable is matric matric variables are
variable is matric matric variables are for example the salary of a person the
for example the salary of a person the body size or the electricity consumption
body size or the electricity consumption so unlike simple linear regression
so unlike simple linear regression multiple linear regression can include
multiple linear regression can include two or more independent variables but
two or more independent variables but what impact does that have on the
what impact does that have on the regression equation in the case of
regression equation in the case of linear regression this was our equation
linear regression this was our equation we had one dependent variable Y and one
we had one dependent variable Y and one independent variable X now in multiple
independent variable X now in multiple linear regression we have more than one
linear regression we have more than one dependent variable but don't worry the
dependent variable but don't worry the coefficients b and a are interpreted
coefficients b and a are interpreted similarly to those in a simple linear
similarly to those in a simple linear regression if all independent variables
regression if all independent variables are zero the value a is obtained so we
are zero the value a is obtained so we get a value of a for the dependent
get a value of a for the dependent variable y furthermore if an independent
variable y furthermore if an independent variable increases by one unit the
variable increases by one unit the associated coefficient B indicates the
associated coefficient B indicates the corresponding change in the dependent
corresponding change in the dependent variable
variable okay let's make one small adjustment
okay let's make one small adjustment going forward instead of Y we'll use y
going forward instead of Y we'll use y head but why in the previous video we
head but why in the previous video we learned that regression aims to model
learned that regression aims to model the dependent variable as accurately as
the dependent variable as accurately as possible however when working with real
possible however when working with real world data there's always some error in
world data there's always some error in other words the True Values often differ
other words the True Values often differ from the predictions now y hat
from the predictions now y hat represents the predicted value values
represents the predicted value values from the regression model While y
from the regression model While y denotes The observed actual values great
denotes The observed actual values great you're gradually becoming an expert now
you're gradually becoming an expert now there are four topics to cover the
there are four topics to cover the assumptions of regression how to
assumptions of regression how to calculate a regression with data tab how
calculate a regression with data tab how to interpret the results and finally how
to interpret the results and finally how to handle categorical variables in
to handle categorical variables in regression by creating dami variables
regression by creating dami variables let's start with the first topic so what
let's start with the first topic so what are the assumptions the first four
are the assumptions the first four assumptions of multiple linear
assumptions of multiple linear regression are similar to those of
regression are similar to those of simple linear regression but there is an
simple linear regression but there is an additional fifth assumption let's
additional fifth assumption let's briefly recap the first four assumptions
briefly recap the first four assumptions and then we'll go into more details on
and then we'll go into more details on the fifth assumption let's start with
the fifth assumption let's start with the first one linear relationship in the
the first one linear relationship in the case of the simple linear regression we
case of the simple linear regression we were able to test this assumption easily
were able to test this assumption easily a straight line is drawn through the
a straight line is drawn through the data this straight line should represent
data this straight line should represent All Points as good as possible if the
All Points as good as possible if the relationship is nonlinear the straight
relationship is nonlinear the straight line cannot fulfill this requirement in
line cannot fulfill this requirement in simple linear regression we have one
simple linear regression we have one independent variable and one dependent
independent variable and one dependent variable making it straightforward to
variable making it straightforward to plot the data points and the regression
plot the data points and the regression line in contrast multiple linear
line in contrast multiple linear regression involves multiple independent
regression involves multiple independent variables which complicate Ates the
variables which complicate Ates the visualization however you can still plot
visualization however you can still plot each independent variable separately
each independent variable separately against the dependent variable to gain
against the dependent variable to gain an initial sense of whether a linear
an initial sense of whether a linear relationship might exist number two
relationship might exist number two independence of Errors the errors so the
independence of Errors the errors so the differences between actual and predicted
differences between actual and predicted values should be independent of each
values should be independent of each other this means that the arrow of one
other this means that the arrow of one point doesn't affect another one we can
point doesn't affect another one we can test this with the Durban Watson test
test this with the Durban Watson test number three
number three homoscedasticity or equal variance of
homoscedasticity or equal variance of Errors if we plot the errors on the Y
Errors if we plot the errors on the Y AIS and the predicted values from the
AIS and the predicted values from the regression model on the xaxis their
regression model on the xaxis their spread should be roughly the same across
spread should be roughly the same across all values of X in other words the
all values of X in other words the variance of the error should remain
variance of the error should remain constant in this this case the
constant in this this case the assumption is fulfilled but what about
assumption is fulfilled but what about that case here we observe unequal
that case here we observe unequal variance at low values of X the errors
variance at low values of X the errors are small while at high values the
are small while at high values the variance of the errors becomes much
variance of the errors becomes much larger number four normally distributed
larger number four normally distributed errors the arrrow should be normally
errors the arrrow should be normally distributed we can test this with a QQ
distributed we can test this with a QQ plot or with analytical tests if you
plot or with analytical tests if you like you can check out my video test for
like you can check out my video test for normal distribution for a deeper dive
normal distribution for a deeper dive and what about the fifth assumption no
and what about the fifth assumption no multicolinearity first of all what is
multicolinearity first of all what is multicolinearity in regression
multicolinearity in regression multicolinearity means that two or more
multicolinearity means that two or more independent variables are highly
independent variables are highly correlated with each other as a result
correlated with each other as a result the effect of individual variables
the effect of individual variables cannot be clearly separated why is that
cannot be clearly separated why is that a problem let's look at the regression
a problem let's look at the regression equation again we have here the
equation again we have here the dependent variable and there the
dependent variable and there the independent variables with the
independent variables with the respective coefficients for example if
respective coefficients for example if there is a high correlation between X1
there is a high correlation between X1 and X2 or if these two variables are
and X2 or if these two variables are almost equal then it is difficult to
almost equal then it is difficult to determine B1 and B2 if both are
determine B1 and B2 if both are completely equal the regression model
completely equal the regression model cannot determine how large B1 and how
cannot determine how large B1 and how large B2 should be this means that one
large B2 should be this means that one independent variable can be predicted
independent variable can be predicted from the others with a high degree of
from the others with a high degree of accuracy an example imagine you're
accuracy an example imagine you're trying to predict the price of a house
trying to predict the price of a house to do this you use the size of the house
to do this you use the size of the house the number of rooms and some other
the number of rooms and some other variables usually the size of the house
variables usually the size of the house is related to the number of rooms large
is related to the number of rooms large houses tend to have more rooms so these
houses tend to have more rooms so these two variables are correlated if we now
two variables are correlated if we now include both in our regression model the
include both in our regression model the model will struggle to decide how much
model will struggle to decide how much of the price is influenced by size and
of the price is influenced by size and how much is influenced by number of
how much is influenced by number of rooms this is because they overlap in
rooms this is because they overlap in the information they provide and this is
the information they provide and this is multicolinearity in this case it becomes
multicolinearity in this case it becomes impossible to reliably determine the
impossible to reliably determine the regression coefficients if you just want
regression coefficients if you just want to use the regression model for a
to use the regression model for a prediction the presence of
prediction the presence of multicolinearity is less critical in
multicolinearity is less critical in this context the focus is on how
this context the focus is on how accurate the prediction is rather than
accurate the prediction is rather than on understanding the influence of the
on understanding the influence of the individual
individual variables however if the regression
variables however if the regression model is used to assess the influence of
model is used to assess the influence of independent variables on the dependent
independent variables on the dependent variable there should be no
variable there should be no multicolinearity
multicolinearity okay but how do we detect m
okay but how do we detect m multicolinearity if we look at the
multicolinearity if we look at the regression equation again we have the
regression equation again we have the variable X1 X2 and up to variable XK we
variable X1 X2 and up to variable XK we now want to determine whether X1 is
now want to determine whether X1 is nearly identical to any other variable
nearly identical to any other variable or a combination of the other variables
or a combination of the other variables for this we simply set up a new
for this we simply set up a new regression model in this regression
regression model in this regression model we take X1 as the new dependent
model we take X1 as the new dependent variable and keep the others as
variable and keep the others as independent variables if we can predict
independent variables if we can predict X1 accurately using the other
X1 accurately using the other independent variables X1 becomes
independent variables X1 becomes unnecessary its information is already
unnecessary its information is already captured by the other variables we can
captured by the other variables we can now do this for all other variables so
now do this for all other variables so we estimate X1 by using the other
we estimate X1 by using the other independent variables we estimate X2 by
independent variables we estimate X2 by using the other variables and we
using the other variables and we estimate XK by using the other
estimate XK by using the other independent variables okay but what is a
independent variables okay but what is a method to detect multical linearity for
method to detect multical linearity for all K regression models we calculate R
all K regression models we calculate R sared which is the so-called coefficient
sared which is the so-called coefficient of determination what is the coefficient
of determination what is the coefficient of determination R sared if X1 is the
of determination R sared if X1 is the dependent variable and the other
dependent variable and the other independent variables are used as
independent variables are used as predictors R squar explains how well the
predictors R squar explains how well the independent variables explain the
independent variables explain the variability of the dependent variable
variability of the dependent variable therefore a high r squared in this
therefore a high r squared in this context suggests that X1 is highly
context suggests that X1 is highly correlated with the other independent
correlated with the other independent variables this is a sign of
variables this is a sign of multicolinearity
multicolinearity using R squar we can calculate the
using R squar we can calculate the tolerance and the variance inflation
tolerance and the variance inflation Factor short
Factor short VI basically the vi is one divided by
VI basically the vi is one divided by the tolerance if the tolerance is less
the tolerance if the tolerance is less than
than 0.1 it indicates potential
0.1 it indicates potential multicolinearity and caution is required
multicolinearity and caution is required on the other hand a vif value greater
on the other hand a vif value greater than 10 is a warning sign of
than 10 is a warning sign of multicolinearity
multicolinearity requiring further investigation
requiring further investigation typically statistical programs like data
typically statistical programs like data tab provide the tolerance and variance
tab provide the tolerance and variance inflation Factor via F values for each
inflation Factor via F values for each independent variable okay but how to
independent variable okay but how to address multicolinearity there are two
address multicolinearity there are two common ways to address multicolinearity
common ways to address multicolinearity number one remove one of the correlated
number one remove one of the correlated variables so choose the variable that is
variables so choose the variable that is less significant and remove it or number
less significant and remove it or number two combine variables create a single
two combine variables create a single variable by combining the correlated
variable by combining the correlated variables e taking an average if you're
variables e taking an average if you're using data Tab and calculate the
using data Tab and calculate the regression you just need to click on
regression you just need to click on test
test assumptions here you can see the table
assumptions here you can see the table with the tolerance and the
with the tolerance and the vif all right let's work through an
vif all right let's work through an example on how to calculate a multip
example on how to calculate a multip linear regression and then look at how
linear regression and then look at how to interpret the results our goal is to
to interpret the results our goal is to analyze the influence of age weight and
analyze the influence of age weight and cholesterol level on blood pressure so
cholesterol level on blood pressure so blood pressure is our dependent variable
blood pressure is our dependent variable while age weight and cholesterol level
while age weight and cholesterol level are our independent variables to
are our independent variables to calculate the regression we just go to
calculate the regression we just go to data.net and copy our data into the
data.net and copy our data into the table if you like you can load the
table if you like you can load the sample data set using the link in the
sample data set using the link in the video description we want to calculate a
video description we want to calculate a regression so we click on regression now
regression so we click on regression now we simply click on BL pressure under
we simply click on BL pressure under dependent variable and age and weight
dependent variable and age and weight and cholesterol level under independent
and cholesterol level under independent variable afterwards we automatically get
variable afterwards we automatically get the results of the regression we will
the results of the regression we will now discuss how to interpret the
now discuss how to interpret the individual tables if you need an
individual tables if you need an interpretation of your individual data
interpretation of your individual data you can just click on AI interpretation
you can just click on AI interpretation at each table and you will get a
at each table and you will get a detailed explanation of your results and
detailed explanation of your results and if you want to test assumptions just
if you want to test assumptions just click here but back to the results let's
click here but back to the results let's start with the most important table the
start with the most important table the table with the regression coefficients
table with the regression coefficients and then take a closer look at the model
and then take a closer look at the model summary table we will focus on these
summary table we will focus on these three columns here we can see the three
three columns here we can see the three independent variables age weight and
independent variables age weight and cholesterol the first row represents the
cholesterol the first row represents the constant so in our regression equation
constant so in our regression equation let's replace X1 X2 and X3 with their
let's replace X1 X2 and X3 with their corresponding names so we want to
corresponding names so we want to predict blood pressure based on a
predict blood pressure based on a person's age weight and cholesterol
person's age weight and cholesterol levels okay in the First Column we see
levels okay in the First Column we see the unstandardized regression
the unstandardized regression coefficients these are our coefficients
coefficients these are our coefficients from the regression equation now we can
from the regression equation now we can calculate the blood pressure for a given
calculate the blood pressure for a given person let's say a person is 55 years
person let's say a person is 55 years old has a weight of 95 kg and a
old has a weight of 95 kg and a cholesterol level of
cholesterol level of 180 then our model would predict a blood
180 then our model would predict a blood pressure of 91 so for example if we look
pressure of 91 so for example if we look at the variable H
at the variable H 0.26 means that for each additional year
0.26 means that for each additional year of age the blood pressure increases by
of age the blood pressure increases by 0.26 units assuming other variables
0.26 units assuming other variables remain constant and what about the
remain constant and what about the standardized coefficients the the
standardized coefficients the the standardized coefficient tells us the
standardized coefficient tells us the relative importance of each independent
relative importance of each independent variable after standardizing the
variable after standardizing the variables to the same scale why is this
variables to the same scale why is this useful our model includes variables
useful our model includes variables measured in different units such as age
measured in different units such as age in years weight in kilograms comparing
in years weight in kilograms comparing their unstandardized coefficients can be
their unstandardized coefficients can be misleading because these coefficients
misleading because these coefficients are influenced by the units of
are influenced by the units of measurement for in instance if weight is
measurement for in instance if weight is measured in tons the coefficient would
measured in tons the coefficient would be larger if we measured in grams it
be larger if we measured in grams it would be smaller Additionally the values
would be smaller Additionally the values for age and years are generally smaller
for age and years are generally smaller than the values for cholesterol so you
than the values for cholesterol so you cannot directly compare their
cannot directly compare their unstandardized coefficients with each
unstandardized coefficients with each other in contrast their standardized
other in contrast their standardized coefficient remains consistent
coefficient remains consistent regardless of the units this allows for
regardless of the units this allows for direct comparison of the relative
direct comparison of the relative effects of different variables for
effects of different variables for example we can see that cholesterol
example we can see that cholesterol level has the largest standardized
level has the largest standardized coefficient indicating that it has the
coefficient indicating that it has the strongest influence on blood pressure in
strongest influence on blood pressure in our video about simple linear regression
our video about simple linear regression we explained the P value in detail the
we explained the P value in detail the interpretation is similar in multiple
interpretation is similar in multiple linear regression to summarize the P
linear regression to summarize the P value shows whether the corresponding
value shows whether the corresponding coefficient is significantly different
coefficient is significantly different from zero in other words it tells us if
from zero in other words it tells us if a variable has a real influence or if
a variable has a real influence or if the result could just be due to chance
the result could just be due to chance if the P value is smaller than
if the P value is smaller than 0.05 it means the difference is
0.05 it means the difference is significant in our case all P values are
significant in our case all P values are smaller than
smaller than 0.05 so all variables have a significant
0.05 so all variables have a significant influence perfect let's move on to the
influence perfect let's move on to the next table the model summary table first
next table the model summary table first we get the multiple correlation
we get the multiple correlation coefficient r R measures the correlation
coefficient r R measures the correlation between the dependent variable and the
between the dependent variable and the combination of the independent variables
combination of the independent variables what does that mean here we have the
what does that mean here we have the equation for linear regression once the
equation for linear regression once the coefficients are determined we can sum
coefficients are determined we can sum all up and calculate the predicted
all up and calculate the predicted values y heads of the dependent variable
values y heads of the dependent variable so if we use our example data we have
so if we use our example data we have the real blood pressure data and we can
the real blood pressure data and we can predict the blood pressure data with the
predict the blood pressure data with the regression model the multiple
regression model the multiple correlation coefficient R is now the
correlation coefficient R is now the correlation between the predicted values
correlation between the predicted values y hat and the actual values Y in other
y hat and the actual values Y in other words the multiple correlation
words the multiple correlation coefficient R indicates the strength of
coefficient R indicates the strength of the correlation between the actual
the correlation between the actual dependent variable and its estimated
dependent variable and its estimated values therefore the greater the
values therefore the greater the correlation the better the regression
correlation the better the regression model in our case an R value of
model in our case an R value of 0.27 indicates a strong positive
0.27 indicates a strong positive relationship okay and what about r
relationship okay and what about r squared r squared is called the
squared r squared is called the coefficient of determination r squared
coefficient of determination r squared indicates the proportion of variance in
indicates the proportion of variance in the dependent variable that is explained
the dependent variable that is explained by the independent variables the greater
by the independent variables the greater the explained variance the better the
the explained variance the better the model's performance for example an r
model's performance for example an r squared value of one would mean that the
squared value of one would mean that the entire variation in blood pressure can
entire variation in blood pressure can be perfectly explained by the variables
be perfectly explained by the variables age weight and cholesterol level however
age weight and cholesterol level however in reality this is really the case an r
in reality this is really the case an r s of
s of 0.52 means that 52% of the variation in
0.52 means that 52% of the variation in blood pressure is explained by the model
blood pressure is explained by the model what is the adjusted r squared the
what is the adjusted r squared the adjusted R squar Accounts for the number
adjusted R squar Accounts for the number of independent variables in the model
of independent variables in the model this provides a more accurate measure of
this provides a more accurate measure of explanatory power when a model includes
explanatory power when a model includes many independent variables the regular R
many independent variables the regular R squar can overestimate how well the
squar can overestimate how well the model explains the data in such cases it
model explains the data in such cases it is recommended to consider the adjusted
is recommended to consider the adjusted R squar to avoid
R squar to avoid overestimation okay and what about the
overestimation okay and what about the standard error of the estimate the
standard error of the estimate the standard error of the estimate measur me
standard error of the estimate measur me the average distance between the
the average distance between the observed data points and the regression
observed data points and the regression line a standard aror of the estimate of
line a standard aror of the estimate of 6.6 indicates that on average the
6.6 indicates that on average the model's predictions deviate from the
model's predictions deviate from the actual values by 6.6 units so if we
actual values by 6.6 units so if we predict a person's blood pressure using
predict a person's blood pressure using their age weight and cholesterol level
their age weight and cholesterol level our prediction will on average deviate
our prediction will on average deviate by 6.6 units from from the person's
by 6.6 units from from the person's actual plot pressure okay if you want an
actual plot pressure okay if you want an interpretation of the other tables
interpretation of the other tables simply click on AI
interpretation earlier in this video I mentioned that independent variables in
mentioned that independent variables in regression analysis can be nominal okay
regression analysis can be nominal okay but what are nominal variables nominal
but what are nominal variables nominal variables are variables with different
variables are variables with different categories like gender with male and
categories like gender with male and female or vehicle type but how do we use
female or vehicle type but how do we use nominal variables in a regression model
nominal variables in a regression model as independent variables let's keep
as independent variables let's keep things simple let's start with variables
things simple let's start with variables with two categories imagine we have the
with two categories imagine we have the variable gender with the categories male
variable gender with the categories male and female now we can code female as
and female now we can code female as zero and male as one the category coded
zero and male as one the category coded with zero is our so-called reference
with zero is our so-called reference category all right let's take a look at
category all right let's take a look at the regression equation suppose the
the regression equation suppose the variable X1 represents gender then B1 is
variable X1 represents gender then B1 is the regression coefficient for gender
the regression coefficient for gender but how do we interpret B1 we said zero
but how do we interpret B1 we said zero is female and one is male so let's just
is female and one is male so let's just insert this for X1 for a female
insert this for X1 for a female individual we have Z multiplied by B1
individual we have Z multiplied by B1 and for a male individual we have one
and for a male individual we have one multiplied by B1 accordingly B1
multiplied by B1 accordingly B1 represents the difference between males
represents the difference between males and females now that we've discussed how
and females now that we've discussed how to handle variables with two values
to handle variables with two values let's explore how to approach variables
let's explore how to approach variables with more than two values let's say we
with more than two values let's say we want to predict the fuel consumption of
want to predict the fuel consumption of a car based on its horsepower and
a car based on its horsepower and vehicle type to keep it simple let's say
vehicle type to keep it simple let's say there are only three vehicle types cedan
there are only three vehicle types cedan sports car and family van thus we have a
sports car and family van thus we have a variable vehicle type with more than two
variable vehicle type with more than two categories however as we know in
categories however as we know in regression we can only include variables
regression we can only include variables with two categories so what's the
with two categories so what's the solution this is where dumy variables
solution this is where dumy variables come into play dumy variables are
come into play dumy variables are artificial variables that make it
artificial variables that make it possible to handle variables with more
possible to handle variables with more than two categories for the variable
than two categories for the variable vehicle type we create a total of three
vehicle type we create a total of three dummy variables is sedan is sports car
dummy variables is sedan is sports car and is family van each of these dummy
and is family van each of these dummy variables has only two possible values
variables has only two possible values zero or one a value of one indicates the
zero or one a value of one indicates the presence of the specific category while
presence of the specific category while a value of zero indicates its absence
a value of zero indicates its absence instead of having one variable with
instead of having one variable with three categories we now have three
three categories we now have three variables with two categories each these
variables with two categories each these newly created dummy variables can be
newly created dummy variables can be included in the regression model okay
included in the regression model okay but what does this mean for our data
but what does this mean for our data preparation initially we have one column
preparation initially we have one column labeled vehicle type where the
labeled vehicle type where the individual vehicle types from our sample
individual vehicle types from our sample are listed the first entry is see then
are listed the first entry is see then the second is also San the third is a
the second is also San the third is a sports car and so on from this column we
sports car and so on from this column we create three new variables for the first
create three new variables for the first vehicle which is the sedan we assign one
vehicle which is the sedan we assign one under the sedan and zero under the
under the sedan and zero under the others as it's neither a sports car nor
others as it's neither a sports car nor a family van similarly the second
a family van similarly the second vehicle is also a seeden the Third
vehicle is also a seeden the Third vehicle however is a sports car so you
vehicle however is a sports car so you assign a one under sports car and zero
assign a one under sports car and zero under the others by doing this we've
under the others by doing this we've successfully created our dummy variables
successfully created our dummy variables one important thing to note the number
one important thing to note the number of dummy variables you create will
of dummy variables you create will always be the number of categories minus
always be the number of categories minus one so in our case we have three
one so in our case we have three categories so we basically only need two
categories so we basically only need two dummy variables why is that the case if
dummy variables why is that the case if we know a vehicle is a sedan we
we know a vehicle is a sedan we automatically know it is neither a
automatically know it is neither a sports car nor a family van similarily
sports car nor a family van similarily if we know it's a sports car we can
if we know it's a sports car we can inere that it's not a seeden or a family
inere that it's not a seeden or a family van finally if it's neither a Sean nor a
van finally if it's neither a Sean nor a sports car we know it must be a family
sports car we know it must be a family van this means we can express the same
van this means we can express the same information with just two variables
information with just two variables instead of three including all three
instead of three including all three variables would make the regression
variables would make the regression model
model overdetermined okay but don't worry if
overdetermined okay but don't worry if you're using data tab it will
you're using data tab it will automatically create the dummy variables
automatically create the dummy variables for you for example if we select fuel
for you for example if we select fuel consumption as the dependent variable
consumption as the dependent variable and horsepower and vehicle type as the
and horsepower and vehicle type as the independent variables we can then see
independent variables we can then see the three categories here and and can
the three categories here and and can select which one to use as the reference
select which one to use as the reference category for example if we choose sedan
category for example if we choose sedan as the reference dummy variables will be
as the reference dummy variables will be created for sports car and family van
created for sports car and family van when we examine the results we will see
when we examine the results we will see the two variables vehicle type sports
the two variables vehicle type sports car and vehicle type family van along
car and vehicle type family van along with
with horsepower so far we focused on linear
horsepower so far we focused on linear regression where our goal is to predict
regression where our goal is to predict a continuous variable like sales or
a continuous variable like sales or house prices but what if we need to
house prices but what if we need to predict categories instead like whether
predict categories instead like whether a costomer will buy a product or not or
a costomer will buy a product or not or whether an email is a spam that's where
whether an email is a spam that's where logistic regression comes into play so
logistic regression comes into play so in the next section we'll explore how
in the next section we'll explore how logistic regression helps us model
logistic regression helps us model binary outcomes and make probability
binary outcomes and make probability based predictions let's jump in what is
based predictions let's jump in what is a logistic Reg regression how is it
a logistic Reg regression how is it calculated and most importantly how are
calculated and most importantly how are the results interpreted let's start with
the results interpreted let's start with the first question what is a regression
the first question what is a regression in a regression analysis you want to
in a regression analysis you want to infer or predict an outcome variable
infer or predict an outcome variable based on one or more other variables
based on one or more other variables okay so what about logistic regression a
okay so what about logistic regression a binary logistic regression is now a type
binary logistic regression is now a type of regression analysis used when the
of regression analysis used when the outcome variable is binary meaning it
outcome variable is binary meaning it has two possible values like yes or no
has two possible values like yes or no success or failure let's look at an
success or failure let's look at an example let's say we are researchers and
example let's say we are researchers and we want to know whether a particular
we want to know whether a particular medication and a person's age have an
medication and a person's age have an influence on whether a person gets a
influence on whether a person gets a certain disease or not so the outcome
certain disease or not so the outcome we're interested in is whether the
we're interested in is whether the patients developed the disease or did
patients developed the disease or did not develop it and our independent
not develop it and our independent variables are medication and age of a
variables are medication and age of a person now with the help of a logistic
person now with the help of a logistic regression we want to inere or predict
regression we want to inere or predict the outcome variable based on the
the outcome variable based on the independent variables okay but what is
independent variables okay but what is the difference between a linear and a
the difference between a linear and a logistic regression in a linear
logistic regression in a linear regression the dependent variable is a
regression the dependent variable is a metric variable it salary or electricity
metric variable it salary or electricity consumption in a logistic regression the
consumption in a logistic regression the dependent variable is a binary variable
dependent variable is a binary variable so with the help of logistic regression
so with the help of logistic regression we can determine what has an influence
we can determine what has an influence on whether a certain disease is present
on whether a certain disease is present or not for example we could study the
or not for example we could study the influence of age gender and smoking
influence of age gender and smoking status on that particular disease in
status on that particular disease in this case one stands for diseased and
this case one stands for diseased and zero for not diseased we now want to
zero for not diseased we now want to estimate the probability that a person
estimate the probability that a person is diseased so our data set might look
is diseased so our data set might look like this here we have the independent
like this here we have the independent variables and there the dependent
variables and there the dependent variable with zero and one we could now
variable with zero and one we could now investigate what influence the
investigate what influence the independent variables have on the
independent variables have on the disease if there is an influence we can
disease if there is an influence we can predict How likely someone is to have a
predict How likely someone is to have a disease okay but why do we need logistic
disease okay but why do we need logistic regression in this case why can't we
regression in this case why can't we just use linear regression a quick recap
just use linear regression a quick recap in linear regression this is our
in linear regression this is our regression equation we have the
regression equation we have the dependent variable the independent
dependent variable the independent variables and the regression
variables and the regression coefficients however our dependent
coefficients however our dependent variable is now binary taking on the
variable is now binary taking on the value of either zero or one regardless
value of either zero or one regardless of the values of the independent
of the values of the independent variables the outcome will always be
variables the outcome will always be zero or one a linear regression would
zero or one a linear regression would now simply put a straight line through
now simply put a straight line through the points we can now see that in the
the points we can now see that in the case of linear regression values between
case of linear regression values between minus and plus infinity can occur
minus and plus infinity can occur however the goal of logistic regression
however the goal of logistic regression is to estimate the probability of
is to estimate the probability of occurrence the value range for the
occurrence the value range for the prediction should therefore be between
prediction should therefore be between zero and one so we need a function that
zero and one so we need a function that only takes values between zero and one
only takes values between zero and one and that is exactly what the logistic
and that is exactly what the logistic function does no matter where we are on
function does no matter where we are on the X AIS between minus and plus
the X AIS between minus and plus infinity only values between 0er and one
infinity only values between 0er and one result and that is exactly what we want
result and that is exactly what we want the equation for the logistic function
the equation for the logistic function looks like this logistic regression now
looks like this logistic regression now uses the logistic function for that the
uses the logistic function for that the equation of the linear regression is now
equation of the linear regression is now simply used this gives us that equation
simply used this gives us that equation this equation gives us the probability
this equation gives us the probability of the dependent variable equal one
of the dependent variable equal one given specific values of the independent
given specific values of the independent variables hm what does this look like
variables hm what does this look like for our example now in our example the
for our example now in our example the probability of having a certain disease
probability of having a certain disease is a function of age gender and smoking
is a function of age gender and smoking status next we need to determine the
status next we need to determine the coefficients that help our model best
coefficients that help our model best fit the given data this is done using
fit the given data this is done using the maximum likelihood method for this
the maximum likelihood method for this there are numerical methods that can
there are numerical methods that can solve the problem effectively a
solve the problem effectively a statistics program such as data tab
statistics program such as data tab therefore calculates the coefficients
therefore calculates the coefficients all right let's work through this
all right let's work through this example on how to calculate a logistic
example on how to calculate a logistic regression and then look at how to
regression and then look at how to interpret the results to calculate the
interpret the results to calculate the regression we just go to data.net and
regression we just go to data.net and copy our data into this table if you
copy our data into this table if you like you can load the sample data using
like you can load the sample data using the link in the video description we
the link in the video description we want to calculate a logistic regression
want to calculate a logistic regression so we just click on regression we choose
so we just click on regression we choose disease as the dependent variable and
disease as the dependent variable and age gender and smoking status as the
age gender and smoking status as the independent variables datab now
independent variables datab now calculates a logistic regression for us
calculates a logistic regression for us depending on how our dependent variable
depending on how our dependent variable is scaled data will calculate either a
is scaled data will calculate either a logistic or a linear regression under
logistic or a linear regression under the tab
the tab regression since we have two categorical
regression since we have two categorical variables we can set the reference
variables we can set the reference category we will just use female and
category we will just use female and nonsmoker as reference now we can choose
nonsmoker as reference now we can choose for which category we want to build the
for which category we want to build the regression model so we can decide if we
regression model so we can decide if we want to predict if a person is diseased
want to predict if a person is diseased or not diseased instead of diseased and
or not diseased instead of diseased and not deceased we could of course also
not deceased we could of course also have one and zero okay he before we go
have one and zero okay he before we go into detail about the different results
into detail about the different results a little tip if you don't know how to
a little tip if you don't know how to interpret the results you can also just
interpret the results you can also just click on summary inverts a logistic
click on summary inverts a logistic regression analysis was performed to
regression analysis was performed to examine the influence of age gender fale
examine the influence of age gender fale and smoking status smoker on the
and smoking status smoker on the variable disease to predict the value
variable disease to predict the value diseased logistic regression analysis
diseased logistic regression analysis showed that the model as a whole was
showed that the model as a whole was significant ific and then comes the
significant ific and then comes the interpretation of the different
interpretation of the different independent variables further you can
independent variables further you can click on AI interpretation at the
click on AI interpretation at the different tables we will now carefully
different tables we will now carefully go through each table step by step to
go through each table step by step to ensure everything is clear to you let's
ensure everything is clear to you let's begin at the top first we get the result
begin at the top first we get the result table here we can see that a total of 36
table here we can see that a total of 36 people were examined with the help of
people were examined with the help of the regression model of these 36 person
the regression model of these 36 person 26 could be correctly assigned that is
26 could be correctly assigned that is 72.22% next is the classification table
72.22% next is the classification table this table shows how often the
this table shows how often the categories not deceased and deceased
categories not deceased and deceased were observed and how frequently they
were observed and how frequently they were predicted in total not deceased was
were predicted in total not deceased was observed 16 times among these 16
observed 16 times among these 16 individuals the regression model
individuals the regression model correctly classified 11 as not deceased
correctly classified 11 as not deceased while misclassifying five as deceased of
while misclassifying five as deceased of the 20 deceased individuals the
the 20 deceased individuals the regression model misclassified five as
regression model misclassified five as not deceased and correctly classified 15
not deceased and correctly classified 15 as deceased but how do we determine
as deceased but how do we determine whether a person is classified as
whether a person is classified as diseased or not as mentioned earlier
diseased or not as mentioned earlier logistic regression provides the
logistic regression provides the probability of a person being diseased
probability of a person being diseased so we obtain values ranging from 0 to
so we obtain values ranging from 0 to 100% now we simply set a threshold of
100% now we simply set a threshold of 50% if a value exceeds 50% the person is
50% if a value exceeds 50% the person is classified as diseased otherwise they
classified as diseased otherwise they are classified as not deceased of course
are classified as not deceased of course you can choose a threshold other than
you can choose a threshold other than 50% to learn more about this check out
50% to learn more about this check out our video on the r curve so let's have a
our video on the r curve so let's have a look at the next table the kai Square
look at the next table the kai Square test evaluates whether the model as a
test evaluates whether the model as a whole is statistically significant for
whole is statistically significant for this two models are compared in one
this two models are compared in one model all independent variables are used
model all independent variables are used and in the other model the independent
and in the other model the independent variables are not used now we can
variables are not used now we can compare how good the prediction is when
compare how good the prediction is when the independent variables are used and
the independent variables are used and how good it is when the independent
how good it is when the independent variables are not used the kai Square
variables are not used the kai Square test now tells us if there is a
test now tells us if there is a significant difference between these two
significant difference between these two results the null hypothesis is that both
results the null hypothesis is that both models are the same if the P value is
models are the same if the P value is less than
less than 0.05 this null hypothesis is rejected in
0.05 this null hypothesis is rejected in our example the P value is less than
our example the P value is less than 0.05 and we assume that there is a
0.05 and we assume that there is a significant difference between the
significant difference between the models thus the model as a whole is
models thus the model as a whole is significant next comes the model model
significant next comes the model model summary in this table we can see on the
summary in this table we can see on the one hand the minus 2 log likelihood
one hand the minus 2 log likelihood value and on the other hand we are given
value and on the other hand we are given different coefficients of the
different coefficients of the determination r squar r squ is used to
determination r squar r squ is used to find out how well the regression model
find out how well the regression model explains the dependent variable in a
explains the dependent variable in a linear regression the r squ indicates
linear regression the r squ indicates the proportion of the variance that can
the proportion of the variance that can be explained by the independent
be explained by the independent variables the more variance can be
variables the more variance can be explained the better the regression
explained the better the regression model in a logistic regression however
model in a logistic regression however its interpretation differ and multiple
its interpretation differ and multiple methods exist to calculate R squar
methods exist to calculate R squar unfortunately there's no consensus yet
unfortunately there's no consensus yet on which method is considered the best
on which method is considered the best data daab gives you the r squared
data daab gives you the r squared according to Cox and Snell according to
according to Cox and Snell according to Nagle ker and according to MC feton and
Nagle ker and according to MC feton and now comes the most important table the
now comes the most important table the table with the model coefficients the
table with the model coefficients the most important parameters are the
most important parameters are the coefficient B the P value and the Ys
coefficient B the P value and the Ys ratio we'll now discuss all three
ratio we'll now discuss all three columns in the First Column we can read
columns in the First Column we can read the calculated coefficients from our
the calculated coefficients from our model we can insert these into the
model we can insert these into the regression equation so we get the
regression equation so we get the coefficients for age gender smoker and
coefficients for age gender smoker and the constant for example for a person
the constant for example for a person which is 55 years old is male and is
which is 55 years old is male and is nonsmoker we get a probability of
nonsmoker we get a probability of 36% thus it is 36% likely that a
36% thus it is 36% likely that a 55-year-old male nonsmoker is diseased
55-year-old male nonsmoker is diseased in reality that would certainly be many
in reality that would certainly be many other and different independent
other and different independent variables okay but what about the P
variables okay but what about the P value so the P value shows whether the
value so the P value shows whether the corresponding coefficient is
corresponding coefficient is significantly different from zero in
significantly different from zero in other words it tells us if a variable
other words it tells us if a variable has a real influence or if the result
has a real influence or if the result could just be due to chance if the P
could just be due to chance if the P value is smaller than
value is smaller than 0.05 it means the difference is
0.05 it means the difference is significant in our case all P values are
significant in our case all P values are greater than
greater than 0.05 indicating that none of the
0.05 indicating that none of the variables have a significant influence
variables have a significant influence and finally the odds ratio but what are
and finally the odds ratio but what are odds and and what is the odds ratio
odds and and what is the odds ratio let's start with the odds let's say we
let's start with the odds let's say we have two possible outcomes of something
have two possible outcomes of something success and failure for example if a
success and failure for example if a therapy is successful or not let's say
therapy is successful or not let's say that the probability that the therapy is
that the probability that the therapy is successful is 0.7 so
successful is 0.7 so 70% and thus the probability of failure
70% and thus the probability of failure is 1us 0.7 so
is 1us 0.7 so 0.3 okay but what about the odds odds
0.3 okay but what about the odds odds are defined as the ratio of the
are defined as the ratio of the probability of success and the
probability of success and the probability of failure or in other words
probability of failure or in other words odds represent the ratio of the
odds represent the ratio of the probability of an event happening to the
probability of an event happening to the probability of it not happening if we
probability of it not happening if we look at our example the odds are
look at our example the odds are 0.7 divided by
0.7 divided by 0.3 which equals
0.3 which equals 2.33 this means the event success
2.33 this means the event success is
is 2.33 times more likely to happen than
2.33 times more likely to happen than not so odds give us a measure of the
not so odds give us a measure of the likelihood of an event happening versus
likelihood of an event happening versus it not happening in this case we've
it not happening in this case we've calculated the odds of success of course
calculated the odds of success of course we can also calculate the odds of
we can also calculate the odds of failure all right now that we understand
failure all right now that we understand odds let's talk about odds ratios so
odds let's talk about odds ratios so what are odds ratios let's look at the
what are odds ratios let's look at the example from the beginning we're
example from the beginning we're studying a new medication to reduce the
studying a new medication to reduce the risk of a certain disease so we have a
risk of a certain disease so we have a group a patients with medication and the
group a patients with medication and the group b patients without medication
group b patients without medication let's say in group a we calculated a
let's say in group a we calculated a probability of 60% or 0.6 of getting
probability of 60% or 0.6 of getting diseased so the odds of getting diseased
diseased so the odds of getting diseased is
is 0.6 ided 0.4 which is 1.5 five again
0.6 ided 0.4 which is 1.5 five again odds just represent the ratio of the
odds just represent the ratio of the probability of an event happening to the
probability of an event happening to the probability of it not happening in our
probability of it not happening in our case in group a the likelihood of being
case in group a the likelihood of being diseased is 1.5 time higher than the
diseased is 1.5 time higher than the likelihood of not being diseased let's
likelihood of not being diseased let's say in group b where the patients didn't
say in group b where the patients didn't get the medication the probability of
get the medication the probability of getting deceased is 80% or
getting deceased is 80% or 0.8 so the odds in group b of getting
0.8 so the odds in group b of getting diseased are 0.8 divided by 0.2 so four
diseased are 0.8 divided by 0.2 so four therefore in group b the likelihood of
therefore in group b the likelihood of being diseased is four times higher than
being diseased is four times higher than the likelihood of not being diseased
the likelihood of not being diseased what about the odds ratio with the odds
what about the odds ratio with the odds ratio we can now compare the two groups
ratio we can now compare the two groups to do this we can compare the odds of
to do this we can compare the odds of getting the disease in group a relative
getting the disease in group a relative to the odds of getting the disease in
to the odds of getting the disease in group b so the odds ratio is simply
group b so the odds ratio is simply calculated by dividing the odds in group
calculated by dividing the odds in group a by the odds in group b this results in
a by the odds in group b this results in an odds ratio of
an odds ratio of 0.38 the odds ratio of 0.38 means that
0.38 the odds ratio of 0.38 means that the odds of being deceased in group a
the odds of being deceased in group a are
are 0.38 times the odds of being deceased in
0.38 times the odds of being deceased in group b of course we can also switch the
group b of course we can also switch the order then the odds ratio would be the
order then the odds ratio would be the odds in Group B divided by the odds in
odds in Group B divided by the odds in group a in this case the odds ratio of
group a in this case the odds ratio of approximately 2.67 means that the odds
approximately 2.67 means that the odds of being deceased in group b are 2.67
of being deceased in group b are 2.67 times higher than the odds of being
times higher than the odds of being deceased in group a so an odds ratio is
deceased in group a so an odds ratio is simply a comparison of the odds of an
simply a comparison of the odds of an event occurring in two different groups
event occurring in two different groups the odds ratio indicates how much more
the odds ratio indicates how much more likely the event is to occur in one
likely the event is to occur in one group group compared to the other group
group group compared to the other group if the odds ratio is greater than one
if the odds ratio is greater than one the event is more likely to occur in the
the event is more likely to occur in the first group if it is less than one the
first group if it is less than one the event is less likely in the first group
event is less likely in the first group okay now let's put it all together and
okay now let's put it all together and look at how to interpret the odds ratio
look at how to interpret the odds ratio in logistic regression let's get started
in logistic regression let's get started first of all to calculate a logistic
first of all to calculate a logistic regression we need data let's say we
regression we need data let's say we have data from 50 patients our our
have data from 50 patients our our outcome variable is disease which is
outcome variable is disease which is coded as zero for not diseased and one
coded as zero for not diseased and one for diseased and we have two independent
for diseased and we have two independent variables medication and age now we can
variables medication and age now we can use this data to calculate a logistic
use this data to calculate a logistic regression you can find a link to the
regression you can find a link to the data set in the video description in the
data set in the video description in the First Column we can see the coefficients
First Column we can see the coefficients that Define our model these coefficients
that Define our model these coefficients can be entered into the logistic
can be entered into the logistic regression formula here we can see the
regression formula here we can see the coefficients from the table the constant
coefficients from the table the constant the coefficients for medication and for
the coefficients for medication and for age now we just need to enter a value
age now we just need to enter a value for medication such as one indicating
for medication such as one indicating the patient received medication and a
the patient received medication and a value for age for example 50 then we can
value for age for example 50 then we can calculate the probability in this case
calculate the probability in this case the probability of being diseased is
the probability of being diseased is 0.55 or
0.55 or 55% so for a patient who took the
55% so for a patient who took the medication and is 50 years old the
medication and is 50 years old the probability of being deceased is
probability of being deceased is 55% of course we can simply use data to
55% of course we can simply use data to calculate this probability to do this
calculate this probability to do this just enter one here and 50 there we will
just enter one here and 50 there we will then also get a probability of
then also get a probability of 0.55 and data dep further gives us the
0.55 and data dep further gives us the odds as we know the odds are calculated
odds as we know the odds are calculated by the probability that a certain event
by the probability that a certain event will happen divided by the probability
will happen divided by the probability ility that the event will not happen we
ility that the event will not happen we therefore get
therefore get 0.55 / 1 -
0.55 / 1 - 0.55 which equals
0.55 which equals 1.22 okay but we are not interested in
1.22 okay but we are not interested in the odds alone we're interested in the
the odds alone we're interested in the odds ratio again the odds ratio is
odds ratio again the odds ratio is simply a comparison of the odds of an
simply a comparison of the odds of an event occurring in two different groups
event occurring in two different groups the two groups could be persons who took
the two groups could be persons who took the medication and persons who did not
the medication and persons who did not take the medication therefore if we're
take the medication therefore if we're going back to data tab we just need to
going back to data tab we just need to compare the odds of a person who took
compare the odds of a person who took the
the medication with the odds of a person who
medication with the odds of a person who did not take the medication so to get
did not take the medication so to get the odds ratio we just need to divide
the odds ratio we just need to divide the odds of a person who took the
the odds of a person who took the medication by the odds of a person who
medication by the odds of a person who did not take the medication this results
did not take the medication this results in an odds ratio of
in an odds ratio of 0.64 and surprise the calculated value
0.64 and surprise the calculated value matches the odds ratio listed for the
matches the odds ratio listed for the variable medication the odds ratio of
variable medication the odds ratio of 0.64 for medication indicates that for
0.64 for medication indicates that for individuals who took the medication the
individuals who took the medication the odds of the outcome diseased are
odds of the outcome diseased are 0.64 times the odds of those who did not
0.64 times the odds of those who did not take the medication all right with
take the medication all right with medication we have two groups to compare
medication we have two groups to compare but what about a continuous variable
but what about a continuous variable like AG in this case we simply look at
like AG in this case we simply look at what happens when we increase age by one
what happens when we increase age by one unit for example we might compare the
unit for example we might compare the odds of the outcome for someone aged 50
odds of the outcome for someone aged 50 versus someone aged 51 this allows us to
versus someone aged 51 this allows us to calculate the odds ratio by comparing
calculate the odds ratio by comparing the two odds in this case we get an odds
the two odds in this case we get an odds ratio of
ratio of 1.04 so for each one year increase in
1.04 so for each one year increase in age the odds of the outcome deceased
age the odds of the outcome deceased increase by a factor of
increase by a factor of 1.04 but there's one thing I haven't
1.04 but there's one thing I haven't told you yet the odds ratio can actually
told you yet the odds ratio can actually be calculated simply by exponentiating
be calculated simply by exponentiating each coefficient so e to the power of
each coefficient so e to the power of 0.45 is
0.45 is 0.64 which is the odds ratio of
0.64 which is the odds ratio of medication and e to the power of
medication and e to the power of 0.04 is 1.04
0.04 is 1.04 which is the odds ratio for H to sum it
which is the odds ratio for H to sum it up odds are simply the ratio of the
up odds are simply the ratio of the probability of an event happening to the
probability of an event happening to the probability of it not happening the odds
probability of it not happening the odds ratio is now the ratio of the odds that
ratio is now the ratio of the odds that an event occurs in two different groups
an event occurs in two different groups we just talked about regression where we
we just talked about regression where we predict values based on known
predict values based on known relationships let's now shift our Focus
relationships let's now shift our Focus to discovering hidden patterns in data
to discovering hidden patterns in data with no apparent relationship this
with no apparent relationship this brings us to class analysis specifically
brings us to class analysis specifically the K means clustering technique K means
the K means clustering technique K means clustering is a powerful method used to
clustering is a powerful method used to identify hidden groups or clusters
identify hidden groups or clusters within our data let's explore how that
within our data let's explore how that works and how it can enhance our
works and how it can enhance our understanding of complex data sets in
understanding of complex data sets in this video I will explain to you
this video I will explain to you everything you need to know about clust
everything you need to know about clust analysis I will start with the question
analysis I will start with the question what is the K mean claster and then I
what is the K mean claster and then I will show you how it can easily
will show you how it can easily calculated online with data Tab and now
calculated online with data Tab and now let's start with the question what is
let's start with the question what is the C's clust analysis the K means class
the C's clust analysis the K means class analysis is one of the simplest and most
analysis is one of the simplest and most common methods for class analysis by
common methods for class analysis by using the K means method you can cluster
using the K means method you can cluster your data by a given number of clusters
your data by a given number of clusters so you already need to Define beforehand
so you already need to Define beforehand the number of clusters for example you
the number of clusters for example you have a data set and you want to Cluster
have a data set and you want to Cluster your cases into three clusters this can
your cases into three clusters this can be done with the C's clust analysis for
be done with the C's clust analysis for example you could have a data set with
example you could have a data set with 15 European countries and you want to
15 European countries and you want to Cluster them into three country groups
Cluster them into three country groups so now the question is how does the
so now the question is how does the kin's clust analysis work there are five
kin's clust analysis work there are five simple steps required let's start with
simple steps required let's start with the first step first you have to define
the first step first you have to define the number of clusters to find the
the number of clusters to find the groups or clusters the number of
groups or clusters the number of clusters is the K in K means in our case
clusters is the K in K means in our case we simply select three classs so in this
we simply select three classs so in this example K was selected equal to three
example K was selected equal to three the second step now is to set the
the second step now is to set the cluster centers random each of the
cluster centers random each of the centes now represent
centes now represent one cluster let's come to step three now
one cluster let's come to step three now we have selected the number of clusters
we have selected the number of clusters and we set the claster centers randomly
and we set the claster centers randomly now we assign each element to one
now we assign each element to one claster so for example we assign each
claster so for example we assign each country to one cluster let's start with
country to one cluster let's start with one element and now the distance from
one element and now the distance from the first element to each of the claster
the first element to each of the claster centroids is calculated so for example
centroids is calculated so for example we calc calate the distance from this
we calc calate the distance from this element to each cluster centroid
element to each cluster centroid afterwards each element is assigned to
afterwards each element is assigned to the claster to which it has the smallest
the claster to which it has the smallest distance in our example the distance
distance in our example the distance between this element and the claster
between this element and the claster centroid is the smallest so we will
centroid is the smallest so we will assign this element to the yellow
assign this element to the yellow cluster now this step is repeated for
cluster now this step is repeated for all further elements so at the end we
all further elements so at the end we have one yellow cluster one red cluster
have one yellow cluster one red cluster and one green cluster and then all
and one green cluster and then all points are initially assigned to a
points are initially assigned to a cluster so let's summarize it again we
cluster so let's summarize it again we first have defined the number of
first have defined the number of clusters we have then assigned the
clusters we have then assigned the cluster centroid randomly and we have
cluster centroid randomly and we have assigned each element in step four we
assigned each element in step four we now calculate the center of each cluster
now calculate the center of each cluster so for the green elements for the yellow
so for the green elements for the yellow elements and for the red elements the
elements and for the red elements the center of each clust is calculated these
center of each clust is calculated these centers are the new claster centroids
centers are the new claster centroids now this means that we simply shift the
now this means that we simply shift the centroids into the cluster Center so the
centroids into the cluster Center so the cluster centroids are moved to the
cluster centroids are moved to the cluster centers now in Step number five
cluster centers now in Step number five we assign the elements to the new
we assign the elements to the new clusters since the centroids now can be
clusters since the centroids now can be located at a different element each
located at a different element each element is assigned to the claster that
element is assigned to the claster that is closest to it
is closest to it now we have finished all steps and from
now we have finished all steps and from now on step four and five are repeated
now on step four and five are repeated until the claster solution does not
until the claster solution does not change anymore then the classing
change anymore then the classing procedure is over one disadvantage of
procedure is over one disadvantage of the K means method is that the final
the K means method is that the final results depends very much on which
results depends very much on which initial clusters we used to take this
initial clusters we used to take this into account the whole procedure is
into account the whole procedure is carried out several times and different
carried out several times and different randomly chosen starting points are used
randomly chosen starting points are used for each of the
for each of the calculations each time we use different
calculations each time we use different starting points it could be that the
starting points it could be that the outcome is different so we do the whole
outcome is different so we do the whole class analysis several times in order to
class analysis several times in order to get the best possible result if you use
get the best possible result if you use data to calculate the class analysis the
data to calculate the class analysis the analysis is for example done 10 times
analysis is for example done 10 times with 10 different randomly chosen sty
with 10 different randomly chosen sty starting points and then at the end the
starting points and then at the end the best cluster solution is chosen so the
best cluster solution is chosen so the next question is what is the optimal
next question is what is the optimal claster number with each new cluster the
claster number with each new cluster the sum distance in the Clusters gets
sum distance in the Clusters gets smaller and smaller so if we have a look
smaller and smaller so if we have a look at this picture where we have two classs
at this picture where we have two classs and that picture where we have three
and that picture where we have three classers for sure these three clusters
classers for sure these three clusters fit the data better than these two
fit the data better than these two clusters the distance between the
clusters the distance between the elements and the Clusters is in this
elements and the Clusters is in this case higher than in that case so the
case higher than in that case so the question now is how many clusters should
question now is how many clusters should be used in order to answer this question
be used in order to answer this question we use the elbow method with each
we use the elbow method with each additional cluster the summed distance
additional cluster the summed distance between the elements and the claster
between the elements and the claster center becomes smaller and smaller
center becomes smaller and smaller however there is a claster number from
however there is a claster number from which each additional cluster reduces
which each additional cluster reduces the sum distance only slightly this
the sum distance only slightly this point is used as the number of clusters
point is used as the number of clusters so if we have a look at this plot we can
so if we have a look at this plot we can see that there is a big gap between
see that there is a big gap between cluster number one and two and there's
cluster number one and two and there's also a big gap between the number of
also a big gap between the number of clusters two and three but there's only
clusters two and three but there's only a small gap between number of clusters
a small gap between number of clusters three and four so in this case we select
three and four so in this case we select a number of clusters of three clusters
a number of clusters of three clusters now I'd like to show you how you can
now I'd like to show you how you can easily calculate a k means class
easily calculate a k means class analysis online with data tab to do this
analysis online with data tab to do this please visit data
please visit data t.net and click on the statistics
t.net and click on the statistics calculator in order to calculate a class
calculator in order to calculate a class analysis you simply choose the tab
analysis you simply choose the tab claster if you want to use your own data
claster if you want to use your own data you can clear the table and then copy
you can clear the table and then copy your own data into this table I will
your own data into this table I will simply use the example data now so I
simply use the example data now so I want to calculate a k means clust
want to calculate a k means clust analysis let's say we want to clust a
analysis let's say we want to clust a group of people by salary and age first
group of people by salary and age first we can Define the number of clusters and
we can Define the number of clusters and we enter the number of clusters data
we enter the number of clusters data deab will now calculate everything for
deab will now calculate everything for you and you get your results right away
you and you get your results right away so here you can see the three clusters
so here you can see the three clusters with the centroids and there you can see
with the centroids and there you can see the elbow method in this case the
the elbow method in this case the results indicate that we should use a
results indicate that we should use a solution with two two clusters we have
solution with two two clusters we have selected three clusters before so we can
selected three clusters before so we can change it in order to get the best
change it in order to get the best suitable number of clusters but let's go
suitable number of clusters but let's go through the results now step by step
through the results now step by step here we can see how many elements are
here we can see how many elements are assigned to the different clusters and
assigned to the different clusters and here we get the plot with the different
here we get the plot with the different clusters here we can see one cluster we
clusters here we can see one cluster we see one cluster there and we see another
see one cluster there and we see another cluster here further we get a table
cluster here further we get a table where each element is allocated to a
where each element is allocated to a cluster if we now choose two as the
cluster if we now choose two as the number of clusters we get new results
number of clusters we get new results and we can see them here in this plot we
and we can see them here in this plot we have one claster here and one claster
have one claster here and one claster there more over we can see that the two
there more over we can see that the two clusters fit the data quite well there's
clusters fit the data quite well there's one more important concept we need to
one more important concept we need to discuss confidence intervals in the next
discuss confidence intervals in the next video we'll break down what confidence
video we'll break down what confidence intervals are why they matter in
intervals are why they matter in statistical analysis and how to
statistical analysis and how to interpret them properly let's dive in in
interpret them properly let's dive in in this video we'll uncover the true
this video we'll uncover the true definition of a confidence interval we
definition of a confidence interval we will clear up some common misconceptions
will clear up some common misconceptions and explain the difference between the
and explain the difference between the incorrect and the correct interpretation
incorrect and the correct interpretation so let's get started first of all why do
so let's get started first of all why do we need confidence intervals in
we need confidence intervals in statistics par parameters of the
statistics par parameters of the population are often estimated based on
population are often estimated based on a sample therefore on the one hand you
a sample therefore on the one hand you have the population but since in most
have the population but since in most cases you cannot serve the entire
cases you cannot serve the entire population you draw a sample now we want
population you draw a sample now we want to use this sample to estimate a
to use this sample to estimate a parameter of the population parameters
parameter of the population parameters that can be estimated are for example
that can be estimated are for example the mean or the variance let's look at
the mean or the variance let's look at an example you want to know the height
an example you want to know the height of all profession basketball players in
of all profession basketball players in the US in order to figure this out you
the US in order to figure this out you draw a sample the mean of the sample is
draw a sample the mean of the sample is most likely different from the mean of
most likely different from the mean of the population let's assume that we draw
the population let's assume that we draw not just one but several samples which
not just one but several samples which of course you don't actually do in
of course you don't actually do in practice each sample is likely to show a
practice each sample is likely to show a different mean so in the first sample we
different mean so in the first sample we have one mean in the second sample we
have one mean in the second sample we most likely have another mean and again
most likely have another mean and again in another sample we have another mean
in another sample we have another mean of course it is also possible that
of course it is also possible that purely by chance two or more samples
purely by chance two or more samples have means that are exactly the same but
have means that are exactly the same but this is very unlikely now it would be
this is very unlikely now it would be extremely valuable to have a range that
extremely valuable to have a range that we expect to capture the true parameter
we expect to capture the true parameter with a certain level of confidence and
with a certain level of confidence and this is precisely where the
this is precisely where the misconception about confidence intervals
misconception about confidence intervals comes in in fact published Studies have
comes in in fact published Studies have shown that scientists frequently
shown that scientists frequently misinterpret conf confidence intervals
misinterpret conf confidence intervals let's dive in and break down exactly
let's dive in and break down exactly what a confidence interval means and
what a confidence interval means and just as importantly what it does not
just as importantly what it does not mean there are two common ways to
mean there are two common ways to explain the confidence interval on the
explain the confidence interval on the one hand there is a simpler explanation
one hand there is a simpler explanation of the confidence interval but it's not
of the confidence interval but it's not correct when feuded from a frequenti
correct when feuded from a frequenti statistics perspective on the other hand
statistics perspective on the other hand there's a slightly more complex
there's a slightly more complex explanation that is actually true to
explanation that is actually true to make the difference clear we'll start
make the difference clear we'll start with the simple but wrong interpretation
with the simple but wrong interpretation then explain why it falls short and
then explain why it falls short and finally lead us to a clearer
finally lead us to a clearer understanding of the correct
understanding of the correct interpretation to keep things simple
interpretation to keep things simple let's focus on the 95% confidence
let's focus on the 95% confidence interval but the same goes for the
interval but the same goes for the others of
others of course so let's address the simple but
course so let's address the simple but incorrect interpretation this
incorrect interpretation this interpretation goes like this there is a
interpretation goes like this there is a 95% chance that the true parameter lies
95% chance that the true parameter lies within a calculated confidence interval
within a calculated confidence interval so what does this actually mean imagine
so what does this actually mean imagine we have a population with a true mean
we have a population with a true mean value this true mean value is the one we
value this true mean value is the one we want to estimate although we don't know
want to estimate although we don't know this true mean we can make an educated
this true mean we can make an educated guess by taking a sample from the
guess by taking a sample from the population from this sample we calculate
population from this sample we calculate both the sample mean and the 95%
both the sample mean and the 95% confidence interval the simplified
confidence interval the simplified interpretation is to say the confidence
interpretation is to say the confidence interval provides a range within which
interval provides a range within which the true mean lies with a certain
the true mean lies with a certain probability or in case of the 95%
probability or in case of the 95% confidence interval we would say there
confidence interval we would say there is a 95% chance that the True Value
is a 95% chance that the True Value Falls within this interval however this
Falls within this interval however this interpretation isn't accurate but why in
interpretation isn't accurate but why in frequent statistics the true parameter
frequent statistics the true parameter in our case the true mean is treated as
in our case the true mean is treated as a fixed but unknown quantity so the true
a fixed but unknown quantity so the true parameter does not move around it is
parameter does not move around it is fixed if we now draw a sample and
fixed if we now draw a sample and calculate the confidence interval the
calculate the confidence interval the True Value either lies inside the
True Value either lies inside the interval or it doesn't in this case the
interval or it doesn't in this case the confidence interval contains the True
confidence interval contains the True Value therefore there's no probability
Value therefore there's no probability associated with the parameter being
associated with the parameter being within this specific interval but why
within this specific interval but why because probabilities in frequentist
because probabilities in frequentist terms only apply to events that are
terms only apply to events that are subject to variability and again the
subject to variability and again the true parameter is fixed and cannot
true parameter is fixed and cannot change the only thing that varies is the
change the only thing that varies is the sample data we collect every time we
sample data we collect every time we draw a new sample we have new data and
draw a new sample we have new data and consequently a new mean and confidence
consequently a new mean and confidence interval so for example in this sample
interval so for example in this sample the True Value Falls within a confidence
the True Value Falls within a confidence interval if we take a second sample
interval if we take a second sample maybe the confidence interval will not
maybe the confidence interval will not include the True Value therefore there's
include the True Value therefore there's no probability associated with the
no probability associated with the parameter being within this specific
parameter being within this specific interval but why the reason is that
interval but why the reason is that probabilities in frequencies terms only
probabilities in frequencies terms only apply to events that are subject to
apply to events that are subject to variability and again the true parameter
variability and again the true parameter is fixed and cannot change therefore you
is fixed and cannot change therefore you cannot assign a probability to the true
cannot assign a probability to the true parameter being in a given interval
parameter being in a given interval parameter is either inside the interval
parameter is either inside the interval or it's not the only thing that varies
or it's not the only thing that varies is the sample data we collect every time
is the sample data we collect every time we draw a new sample we have new data
we draw a new sample we have new data and consequently a new mean and
and consequently a new mean and confidence interval so for example in
confidence interval so for example in all these samples the True Value Falls
all these samples the True Value Falls within the confidence interval while in
within the confidence interval while in those samples it doesn't in summary you
those samples it doesn't in summary you can say that there is a 95% chance that
can say that there is a 95% chance that this interval contains the true
this interval contains the true parameter because once the interval is
parameter because once the interval is calculated it either contains the
calculated it either contains the parameter or it doesn't and there is no
parameter or it doesn't and there is no probability left to assign in the
probability left to assign in the frequen sense but what is the correct
frequen sense but what is the correct interpretation let's say we took a lot
interpretation let's say we took a lot of random samples and we calculated the
of random samples and we calculated the mean value and the confidence interval
mean value and the confidence interval of each sample the confidence interval
of each sample the confidence interval can now be be interpreted in the
can now be be interpreted in the following way if we were to take an
following way if we were to take an extremely large number of random samples
extremely large number of random samples and construct a confidence interval for
and construct a confidence interval for each sample 95% of those intervals would
each sample 95% of those intervals would contain the True Value while 5% would
contain the True Value while 5% would not in other words if we were to take
not in other words if we were to take 100 random samples we would expect that
100 random samples we would expect that on average 95 of the confidence
on average 95 of the confidence intervals would contain the True Value
intervals would contain the True Value while five would not you can also see it
while five would not you can also see it the other way around the confidence
the other way around the confidence interval can be defined in terms of
interval can be defined in terms of probability with respect to a single
probability with respect to a single theoretical sample that has yet to be
theoretical sample that has yet to be realized therefore if you haven't drawn
realized therefore if you haven't drawn the sample yet you can be 95% sure that
the sample yet you can be 95% sure that the interval from the next sample you
the interval from the next sample you draw will contain the True Value but if
draw will contain the True Value but if you have taken the sample the true value
you have taken the sample the true value is either in the interval or not
is either in the interval or not therefore cont confidence is about the
therefore cont confidence is about the method not the specific interval the 95%
method not the specific interval the 95% confidence refers to the long run
confidence refers to the long run reliability of the method you use to
reliability of the method you use to construct the interval it means that if
construct the interval it means that if you use this method repeatedly on
you use this method repeatedly on different samples you expect to capture
different samples you expect to capture the true parameter
the true parameter 95% of the time but once you've applied
95% of the time but once you've applied it and obtained a specific interval you
it and obtained a specific interval you then cannot make a probability statement
then cannot make a probability statement about whether this interval contains the
about whether this interval contains the fixed true parameter or not a side note
fixed true parameter or not a side note in statistics there are two distinct
in statistics there are two distinct approaches or Frameworks the frequentist
approaches or Frameworks the frequentist and the basan the confidence interval is
and the basan the confidence interval is a method used in the frequentist
a method used in the frequentist approach in AAS approach we would treat
approach in AAS approach we would treat the parameter as a random variable with
the parameter as a random variable with its own probability distribution
its own probability distribution reflecting our uncertainity about it in
reflecting our uncertainity about it in that frame work it would make sense to
that frame work it would make sense to say that given our data there is a
say that given our data there is a certain probability that the parameter
certain probability that the parameter will fall within a certain range but
will fall within a certain range but compared to the frequentist
compared to the frequentist interpretation this is a fundamentally
interpretation this is a fundamentally different way of thinking in the basan
different way of thinking in the basan approach there is a concept known as the
approach there is a concept known as the credible interval which serves as the
credible interval which serves as the counterpart to the confidence interval
counterpart to the confidence interval in frequented statistics but
in frequented statistics but unfortunately there's also critics of
unfortunately there's also critics of the basan way in short patient
the basan way in short patient statistics require the use of a
statistics require the use of a so-called prior distribution the main
so-called prior distribution the main criticism is that credible intervals may
criticism is that credible intervals may not be entirely objective as they are
not be entirely objective as they are influenced by the choice of the prior
influenced by the choice of the prior distribution this makes the results
distribution this makes the results potentially sensitive to subjective
potentially sensitive to subjective inputs however this same feature can
inputs however this same feature can also be seen as a strength as it allows
also be seen as a strength as it allows for incorporating prior knowledge into
for incorporating prior knowledge into the analysis in a principled way okay
the analysis in a principled way okay but now to the easiest part how is the
but now to the easiest part how is the confidence interval for the mean
confidence interval for the mean calculated if you data are normally
calculated if you data are normally distributed the confidence interval for
distributed the confidence interval for the mean can be calculated with this
the mean can be calculated with this formula the confidence interval CI is
formula the confidence interval CI is xar plus or minus Z * s / by the root of
xar plus or minus Z * s / by the root of n here xar is the mean set is the set
n here xar is the mean set is the set value for the respective confidence
value for the respective confidence level n is the sample size and S is the
level n is the sample size and S is the standard deviation plus minus results
standard deviation plus minus results from the fact that we have once the
from the fact that we have once the upper limit with plus and once the lower
upper limit with plus and once the lower limit with minus where do we obtain the
limit with minus where do we obtain the set value the set value for a given
set value the set value for a given confidence interval can be found in a
confidence interval can be found in a standard normal distribution table which
standard normal distribution table which lists set values corresponding to the
lists set values corresponding to the different confidence levels for example
different confidence levels for example at a 95% confidence level the set value
at a 95% confidence level the set value is
is 1.96 using this the confidence interval
1.96 using this the confidence interval can be expressed as the sample mean plus
can be expressed as the sample mean plus -
- 1.96 times the standard deviation
1.96 times the standard deviation divided by the square root of the sample
divided by the square root of the sample size the confidence interval can of
size the confidence interval can of course be calculated for many
course be calculated for many statistical parameters not only for the
statistical parameters not only for the mean value thanks for watching and I
mean value thanks for watching and I hope you enjoyed the video bye
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.