What does it mean to be significant?
If the p-value is small, the effect is said to be statistically
significant (hence the term “significance level”). But what is small? What should we choose for the significance level? By choosing a significance level, we draw a line in the sand. On one side of the line is a rejected hypothesis, and on the other an accepted one. I generally discourage a bright line
p-value used to deem a result statistically significant or not. You
computed the p-value, it has a specific meaning; you should report it. I
do not see a need to convert a computed value, the p-value, into a
Boolean True/False on whether or not we attach the word “significant”
to the result. That is, I do not see a need to make a decision, at least not when doing scientific inquiry.
Do we even want to calculate a p-value?
The question the NHST and its result, the p-value, address is rarely the question we want to
ask. For example, say we are doing a test of the null hypothesis that
two sets of measurements have the same mean. In most cases, what we really want is knowledge about the underlying generative distributions of the control and test data. Which of the following two questions better serve that goal?
How different are the means of the two samples?
Would we say there is a statistically significant difference of the
means of the two samples? Or, more precisely, what is the probability
of observing a difference in means of the two samples at least as
large as the observed difference in means, if the two samples in fact
have the same mean?
The second question is convoluted and often of little scientific
interest. I would say that the first question is much more relevant. To
put it in perspective, say we made trillions of measurements of two
different samples and their mean differs by one part per million. This
difference, though tiny, would still give a low p-value, and therefore
often be deemed “statistically significant.” But, ultimately, it is the
size of the difference, or the effect size, we care about.
What is with all those names?
You have no doubt heard of many named hypothesis tests, like
the Student-t test, Welch’s t-test, the Mann-Whitney U-test, and
countless others. What is with all of those names? It helps to think
more generally about how frequentist hypothesis testing is usually done.
To do a hypothesis test, people unfortunately do not do what
I have laid out here with a resampling-based approach, but typically use a named test. Here is an example of that for a Student-t test (borrowing heavily from the treatment in Phil Gregory’s
book).
Choose a null hypothesis. This is the hypothesis you want to test the
truth of.
Choose a suitable test statistic that can be computed from
measurements and has a predictable distribution. For the example of
two sets of repeated measurements where the null hypothesis is that
they come from identical Normal distributions, we can choose as
our statistic
\[\begin{split}\begin{aligned}
T &= \frac{\bar{x}_1 - \bar{x}_2 - (\mu_1 - \mu_2)}{S_D\sqrt{n_1^{-1}+n_2^{-1}}},\nonumber \\[1em]
\text{where }S_D^2 &= \frac{(n_1 - 1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}, \nonumber \\[1em]
\text{with } S_1^2 &= \frac{1}{n_1-1}\sum_{i\in \text{set 1}}(x_{1,i} - \bar{x}_1)^2, \nonumber \\[1em]
S_2^2 &= \frac{1}{n_2-1}\sum_{i\in \text{set 2}}(x_{2,i} - \bar{x}_2)^2.
\end{aligned}\end{split}\]
The T statistic is the
difference of the difference of the observed means and the difference
of the true means, weighted by the spread in the data. This is a
reasonable statistic for determining something about means from data.
This is the appropriate statistic when \(\sigma_1\) and
\(\sigma_2\) are both unknown but assumed to be equal. (When they
are assumed to be unequal, you need to adjust the statistic you use.
This test is called Welch’s t-test.) It can be derived that this
statistic has the Student-t distribution,
\[\begin{split}\begin{aligned}
P(t) &= \frac{1}{\sqrt{\pi \nu}} \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)}
\left(1 + \left(\frac{t^2}{\nu}\right)\right)^{-\frac{\nu + 1}{2}},\\[1em]
\text{where } \nu &= n_1+n_2-2.
\end{aligned}\end{split}\]
Evaluate the test statistic from measured data. In the case of the
Student-t test, we compute \(T\).
Plot \(P(t)\). The area under the curve where \(t > T\) is
the p-value, the probability that we would observe our data under the
null hypothesis. Reject the null hypothesis if this is small.
As you can see from the above prescription, item 2 can be tricky. Coming
up with test statistics that also have a distribution that we can write
down is difficult . When such a test statistic is found, the test
usually gets a name. The main reason for doing things this way is that
most hypothesis tests were developed before computers, so we couldn’t
just bootstrap our way through hypothesis tests. (The bootstrap was
invented by Brad Efron in 1979.) Conversely, in the approach we have
taken, sometimes referred to as “hacker stats,” we can invent any test
statistic we want, and we can test is by numerically “repeating” the
experiment, in accordance with the frequentist interpretation of
probability.
So, I would encourage you not to get caught up in names. If someone
reports a p-value with a name, simply look up the things you need to
define the p-values (the hypothesis being tested, the test statistic,
and what it means to be as extreme), and that will give you an
understanding of what is going on with the test.
That said, many of the tests with names have analytical forms and can be
rapidly computed. Most are included in the scipy.stats
module. I
have chosen to present a method of hypothesis testing that is intuitive
with the frequentist interpretation of probability front and center. It
also allows you to design your own tests that fit a null hypothesis that
you are interested in that might not be “off-the-shelf.” That said, implementations of the named tests are almost always much faster to compute than using resampling methods, so if you have large data sets or computational speed is a consideration for any other reasons, you might want to consider using functions in scipy.stats
.
Warnings about hypothesis tests
There are many. I will discuss them more when I talk about “statistical
pitfalls” later in the course.
An effect being statistically significant does not mean the effect is
significant in practice or even important. It only means exactly what
it is defined to mean: an effect is unlikely to have happened by
chance under the null hypothesis. Far more important is the effect
size.
The p-value is not the probability that the null hypothesis is
true. It is the probability of observing the test statistic being at
least as extreme as what was measured if the null hypothesis is true.
I.e., if \(H_0\) is the null hypothesis,
\[\begin{aligned}
\text{p-value} = P(\text{test stat at least as extreme as observed}\mid H_0).
\end{aligned}\]
It is not the probability that the null hypothesis is true given
that the test statistic was at least as extreme as the data.
\[\begin{aligned}
\text{p-value} \ne P(H_0\mid\text{test stat at least as extreme as observed}).
\end{aligned}\]
We often actually want the probability that the null hypothesis is
true, and the p-value is often erroneously interpreted as this (even
though it does not even make sense in under the frequentist
interpretation of probability) to great peril.
Null hypothesis significance testing does not say anything about
alternative hypotheses. Rejection of the null hypothesis does not
mean acceptance of any other hypotheses.
P-values are not very reproducible, as we will see when we do the “dance of the p-values.”
Rejecting a null hypothesis is also kind of odd, considering you
computed
\[\begin{aligned}
P(\text{test stat at least as extreme as observed}\mid H_0).
\end{aligned}\]
This does not really describe the probability that the hypothesis is
true. This, along with point 4, means that the p-value better be
really low for you to reject the null hypothesis.
Throughout the literature, you will see null hypothesis testing when
the null hypothesis is not relevant at all. People compute p-values
because that’s what they think they are supposed to do. Again, it gets to the
point that effect size is waaaaay more important than a null
hypothesis significance test.
Given all these problems with p-values, based on my experience, I would advocate for their
abandonment. I am not the only
one.
They seldom answer the question scientists are asking and lead to great
confusion.
That said, I know they are widely and effectively used in contexts I am less familiar with. It’s foolish to make strong, blanket statements (like I just did), so consider this a tempering of my advocacy for abandonment.
Comments and opinions on NHST
What does it mean to be significant?
If the p-value is small, the effect is said to be statistically significant (hence the term “significance level”). But what is small? What should we choose for the significance level? By choosing a significance level, we draw a line in the sand. On one side of the line is a rejected hypothesis, and on the other an accepted one. I generally discourage a bright line p-value used to deem a result statistically significant or not. You computed the p-value, it has a specific meaning; you should report it. I do not see a need to convert a computed value, the p-value, into a Boolean True/False on whether or not we attach the word “significant” to the result. That is, I do not see a need to make a decision, at least not when doing scientific inquiry. [1]
Do we even want to calculate a p-value?
The question the NHST and its result, the p-value, address is rarely the question we want to ask. For example, say we are doing a test of the null hypothesis that two sets of measurements have the same mean. In most cases, what we really want is knowledge about the underlying generative distributions of the control and test data. Which of the following two questions better serve that goal?
How different are the means of the two samples?
Would we say there is a statistically significant difference of the means of the two samples? Or, more precisely, what is the probability of observing a difference in means of the two samples at least as large as the observed difference in means, if the two samples in fact have the same mean?
The second question is convoluted and often of little scientific interest. I would say that the first question is much more relevant. To put it in perspective, say we made trillions of measurements of two different samples and their mean differs by one part per million. This difference, though tiny, would still give a low p-value, and therefore often be deemed “statistically significant.” But, ultimately, it is the size of the difference, or the effect size, we care about.
What is with all those names?
You have no doubt heard of many named hypothesis tests, like the Student-t test, Welch’s t-test, the Mann-Whitney U-test, and countless others. What is with all of those names? It helps to think more generally about how frequentist hypothesis testing is usually done.
To do a hypothesis test, people unfortunately do not do what I have laid out here with a resampling-based approach, but typically use a named test. Here is an example of that for a Student-t test (borrowing heavily from the treatment in Phil Gregory’s book).
Choose a null hypothesis. This is the hypothesis you want to test the truth of.
Choose a suitable test statistic that can be computed from measurements and has a predictable distribution. For the example of two sets of repeated measurements where the null hypothesis is that they come from identical Normal distributions, we can choose as our statistic
The T statistic is the difference of the difference of the observed means and the difference of the true means, weighted by the spread in the data. This is a reasonable statistic for determining something about means from data. This is the appropriate statistic when \(\sigma_1\) and \(\sigma_2\) are both unknown but assumed to be equal. (When they are assumed to be unequal, you need to adjust the statistic you use. This test is called Welch’s t-test.) It can be derived that this statistic has the Student-t distribution,
Evaluate the test statistic from measured data. In the case of the Student-t test, we compute \(T\).
Plot \(P(t)\). The area under the curve where \(t > T\) is the p-value, the probability that we would observe our data under the null hypothesis. Reject the null hypothesis if this is small.
As you can see from the above prescription, item 2 can be tricky. Coming up with test statistics that also have a distribution that we can write down is difficult [2]. When such a test statistic is found, the test usually gets a name. The main reason for doing things this way is that most hypothesis tests were developed before computers, so we couldn’t just bootstrap our way through hypothesis tests. (The bootstrap was invented by Brad Efron in 1979.) Conversely, in the approach we have taken, sometimes referred to as “hacker stats,” we can invent any test statistic we want, and we can test is by numerically “repeating” the experiment, in accordance with the frequentist interpretation of probability.
So, I would encourage you not to get caught up in names. If someone reports a p-value with a name, simply look up the things you need to define the p-values (the hypothesis being tested, the test statistic, and what it means to be as extreme), and that will give you an understanding of what is going on with the test.
That said, many of the tests with names have analytical forms and can be rapidly computed. Most are included in the
scipy.stats
module. I have chosen to present a method of hypothesis testing that is intuitive with the frequentist interpretation of probability front and center. It also allows you to design your own tests that fit a null hypothesis that you are interested in that might not be “off-the-shelf.” That said, implementations of the named tests are almost always much faster to compute than using resampling methods, so if you have large data sets or computational speed is a consideration for any other reasons, you might want to consider using functions inscipy.stats
.Warnings about hypothesis tests
There are many. I will discuss them more when I talk about “statistical pitfalls” later in the course.
An effect being statistically significant does not mean the effect is significant in practice or even important. It only means exactly what it is defined to mean: an effect is unlikely to have happened by chance under the null hypothesis. Far more important is the effect size.
The p-value is not the probability that the null hypothesis is true. It is the probability of observing the test statistic being at least as extreme as what was measured if the null hypothesis is true. I.e., if \(H_0\) is the null hypothesis,
It is not the probability that the null hypothesis is true given that the test statistic was at least as extreme as the data.
We often actually want the probability that the null hypothesis is true, and the p-value is often erroneously interpreted as this (even though it does not even make sense in under the frequentist interpretation of probability) to great peril.
Null hypothesis significance testing does not say anything about alternative hypotheses. Rejection of the null hypothesis does not mean acceptance of any other hypotheses.
P-values are not very reproducible, as we will see when we do the “dance of the p-values.”
Rejecting a null hypothesis is also kind of odd, considering you computed
This does not really describe the probability that the hypothesis is true. This, along with point 4, means that the p-value better be really low for you to reject the null hypothesis.
Throughout the literature, you will see null hypothesis testing when the null hypothesis is not relevant at all. People compute p-values because that’s what they think they are supposed to do. Again, it gets to the point that effect size is waaaaay more important than a null hypothesis significance test.
Given all these problems with p-values, based on my experience, I would advocate for their abandonment. I am not the only one. They seldom answer the question scientists are asking and lead to great confusion.
That said, I know they are widely and effectively used in contexts I am less familiar with. It’s foolish to make strong, blanket statements (like I just did), so consider this a tempering of my advocacy for abandonment.