Test statisticsΒΆ
In the (not so distant) past, fast computer simulations were not accessible.
Hypothesis tests had to be based on precomputed tables of p-values for known distributions. But not every situation corresponds directly to a random variable that is easy to describe.
Because there are some situations that are commonly encountered, statisticians developed test statistics:
- Calculated directly from the observed quantities (i.e. they are statistics).
- Under certain assumptions, can be assumed to follow a specified probability distribution.
- Form the basis of a named standard hypothesis test.
As long as the situation correctly fits the assumptions of the test, we can follow the hypothesis testing procedure previously described.
The p-value is now based on the test statistic (and its probability distribution under $H_0$), not the directly observed value.
Student's t-testΒΆ
The independent samples t-test is used to test for significant differences in the mean values of a variable that has been measured for two sub-populations.
The test statistic is
$$ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$
where
$$ s_p = \sqrt{\frac{(n_1 - 1)s_{1}^2 + (n_2 - 1)s_{2}^2}{n_1 + n_2-2}} $$
is the pooled standard deviation of the two samples.
The assumptions for this test are:
- Each sample is independent and identically distributed.
- The sample means $\bar{x}_1$ and $\bar{x}_2$ should behave as if drawn from normal distributions. This is always true if the variables $X_{1}$ and $X_{2}$ are themselves normally distributed, but is also true for other distributions as long as the sample sizes are large enough (via the central limit theorem).
- The two distributions have the same variance. (There is another version of the t-test that does not require this assumption, known as Welch's t-test).
IF
the assumptions of the test are satisfied
AND
the null hypothesis $H_{0}$ is true
THEN
the t statistic behaves as if it is sampled from a Student's t-distribution.
This distribution has a single parameter known as the degrees of freedom, which would be equal to $n_{1} + n_{2} - 1$.
Student's t-test: exampleΒΆ
Do the Adelie populations on Dream and Torgersen differ significantly in their body mass?
$H_{0}$: There is no difference between the mean body mass of the two sub-populations.
$\alpha = 0.05$
Calculate the t-statistic using the observed data.
Adelie penguins on Dream have body mass statistics
- $\bar{x}_1$ = 3688 g
- $s_1$ = 455.1 g
- $n_1$ = 56
Adelie penguins on Torgersen have body mass statistics
- $\bar{x}_2$ = 3706 g
- $s_2$ = 445.1 g
- $n_2$ = 51
$t$ = -0.20624
The t-distribution of interest has degrees of freedom $n1 + n2 - 1$ = 106
What does "as extreme or more extreme" mean in this case?
Clearly, values of $t$ less than -0.20624 are "more extreme".
But also, the t statistic is based on the difference between the two means, which could be positive or negative.
Because we did not a priori have a belief about which population would have bigger body mass, we should also consider the values $t$ > +0.20624 as possible results that are "more extreme" than the observed difference.
This situation is known as a two-tailed test, where deviations of the test statistic in either direction (both the upper and lower tails) are considered relevant to our original question.
In a two-tailed test, we double the p-value associated with the tail that contains our observed value - in this case, the lower tail.
This procedure also works with test statistics having asymmetrical probability distributions.
pval = stats.t.cdf(t_obs, 106) * 2
print("p =", pval)
p = 0.8369976258647323
So p > $\alpha$ and we do not reject the null hypothesis
In practice, we can perform hypothesis tests much more easily using python functions that accept our data and perform all of the necessary calculations.
from scipy.stats import ttest_ind
result = ttest_ind(adelie_dream['body_mass_g'],
adelie_torgersen['body_mass_g'],
nan_policy='omit',
alternative="two-sided")
print("observed t =", result.statistic)
print("p-value =", result.pvalue)
observed t = -0.20624183279962707 p-value = 0.8370013888792112
Chi-square testΒΆ
As another example of a hypothesis test, we will look at the chi-square test for independence.
This test is used to assess whether two categorical variables are independent of each other, by comparing observed frequencies to their expectations under $H_0$ (i.e. independence of the two variables).
The test statistic is
$$ X^2=\sum^k_{i=1}{\frac{(x_i-m_i)^2}{m_i}} $$
where
$x_i$ are the observed frequencies for each of $k$ outcomes
$m_i$ are the expected frequencies under $H_0$
$m_i = np_i$, where $p_i$ are the probabilities for each possible outcome.
Under the null hypothesis, $X^2$ behaves as if sampled from a $\chi^2$ distribution
The degrees of freedom for this distribution = the number of independent frequencies.
Chi-square test: exampleΒΆ
Is the sex ratio of all three penguin species the same?
$H_0$: sex is independent of species
$\alpha$ = 0.05
observed frequencies:
| sex | FEMALE | MALE |
|---|---|---|
| species | ||
| Adelie | 73 | 73 |
| Chinstrap | 34 | 34 |
| Gentoo | 58 | 61 |
expected frequencies:
| sex | FEMALE | MALE |
|---|---|---|
| species | ||
| Adelie | 72.342342 | 73.657658 |
| Chinstrap | 33.693694 | 34.306306 |
| Gentoo | 58.963964 | 60.036036 |
chisq_obs = 0.04860717014078318
Degrees of freedom for a contingency table = (number of columns - 1) * (number of rows - 1)
"As extreme or more extreme" in this case refers only to the upper tail.
pval = 1 - stats.chi2.cdf(chisq_obs, 2)
print("pval =", pval)
p = 0.9759893689765846
Clearly p > $\alpha$ and we do not reject the null hypothesis.
Alternatively , using the appropriate scipy.stats function:
pval = stats.chi2_contingency(pd.crosstab(penguins['species'], columns=penguins['sex'])).pvalue
print("p = ", pval)
p = 0.9759893689765846
Correction for multiple testing.ΒΆ
p-values make sense as a way to assess evidence for individual hypothesis tests.
But we encounter a problem when we start to think about a series of multiple related tests.
Suppose that we want to test every human chromosome for overrepresentation of genes associated with a particular biological process.
We will use significance level $\alpha$ = 0.05 as usual.
chromosome gene_count spermatogenesis 0 1 18332 38 1 2 13869 15 2 3 10564 25 3 4 8217 15 4 5 9534 13 5 6 10600 21 6 7 10050 14 7 8 7922 12 8 9 8235 17 9 10 8735 4 10 11 10030 24 11 12 9163 14 12 13 4792 7 13 14 6480 16 14 15 6660 14 15 16 7840 10 16 17 9604 23 17 18 3661 3 18 19 8851 27 19 20 5407 12 20 21 2888 4 21 22 5012 10 22 X 5693 16 23 Y 1626 10 24 MT 110 0
We can calculate a p-value for each chromosome. In this case using the hypergeometric distribution.
chromosome 1 0.284597 2 0.994084 3 0.141137 4 0.580636 5 0.910373 6 0.431870 7 0.902822 8 0.812496 9 0.379510 10 0.999949 11 0.135521 12 0.818817 13 0.796613 14 0.164008 15 0.371597 16 0.924934 17 0.140757 18 0.968637 19 0.009924 20 0.319097 21 0.791726 22 0.467197 X 0.073705 Y 0.001163 MT 1.000000 dtype: float64
The problem comes when we try to reject the null hypothesis.
pvals < 0.05
chromosome 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 True 20 False 21 False 22 False X False Y True MT False dtype: bool
pvals.iloc[np.where(pvals < 0.05)]
chromosome 19 0.009924 Y 0.001163 dtype: float64
Remember that the meaning of $\alpha$ is the probability of rejecting the null hypothesis, if it were actually true.
Assuming that $H_0$ IS true for every chromosome, in how many of our hypothesis tests do we expect $H_0$ to be rejected?
0.05 * 25
1.25
Clearly we do not want to be generating these false positives. We need to control the false positive rate in some way.
The simplest solution is to fix the so-called family-wise error rate (FWER) - this makes our chosen $\alpha$ the probability of ANY Type I error over all tests.
We simply use the stricter rejection threshold $\alpha/n$, where $n$ is the number of tests.
This is known as the Bonferroni correction
pvals.iloc[np.where(pvals < 0.05/25)]
chromosome Y 0.001163 dtype: float64
The problem with the Bonferroni correction is that it is often too conservative, and will fail to detect some real positives (Type II errors).
A better solution is to control the false discovery rate (FDR). This means that out of the hypotheses that are rejected, we expect a proportion $\alpha$ of them to be Type I errors.
We convert the original p-values to so-called q-values, which we compare to the original $\alpha$.
There are several procedures to do this correction, but the most widely used is known as Benjamini-Hochberg.
qvals = pd.Series(stats.false_discovery_control(pvals), index=pvals.index)
qvals
chromosome 1 0.862523 2 1.000000 3 0.585742 4 1.000000 5 1.000000 6 0.898455 7 1.000000 8 1.000000 9 0.862523 10 1.000000 11 0.585742 12 1.000000 13 1.000000 14 0.585742 15 0.862523 16 1.000000 17 0.585742 18 1.000000 19 0.124056 20 0.862523 21 1.000000 22 0.898455 X 0.585742 Y 0.029086 MT 1.000000 dtype: float64
qvals.iloc[np.where(qvals < 0.05)]
chromosome Y 0.029086 dtype: float64
So in fact it is only the Y chromosome that shows genuine evidence of overrepresentation for this biological process.