Statistics Foundations¶

6. Testing¶

Test statistics¶

In the (not so distant) past, fast computer simulations were not accessible.

Hypothesis tests had to be based on precomputed tables of p-values for known distributions. But not every situation corresponds directly to a random variable that is easy to describe.

Because there are some situations that are commonly encountered, statisticians developed test statistics:

Calculated directly from the observed quantities (i.e. they are statistics).
Under certain assumptions, can be assumed to follow a specified probability distribution.
Form the basis of a named standard hypothesis test.

As long as the situation correctly fits the assumptions of the test, we can follow the hypothesis testing procedure previously described.

The p-value is now based on the test statistic (and its probability distribution under $H_0$), not the directly observed value.

Student's t-test¶

The independent samples t-test is used to test for significant differences in the mean values of a variable that has been measured for two sub-populations.

The test statistic is

$$ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

where

$$ s_p = \sqrt{\frac{(n_1 - 1)s_{1}^2 + (n_2 - 1)s_{2}^2}{n_1 + n_2-2}} $$

is the pooled standard deviation of the two samples.

The assumptions for this test are:

Each sample is independent and identically distributed.

The sample means $\bar{x}_1$ and $\bar{x}_2$ should behave as if drawn from normal distributions. This is always true if the variables $X_{1}$ and $X_{2}$ are themselves normally distributed, but is also true for other distributions as long as the sample sizes are large enough (via the central limit theorem).

The two distributions have the same variance. (There is another version of the t-test that does not require this assumption, known as Welch's t-test).

the assumptions of the test are satisfied

AND

the null hypothesis $H_{0}$ is true

THEN

the t statistic behaves as if it is sampled from a Student's t-distribution.

This distribution has a single parameter known as the degrees of freedom, which would be equal to $n_{1} + n_{2} - 1$.

No description has been provided for this image

Student's t-test: example¶

Do the Adelie populations on Dream and Torgersen differ significantly in their body mass?

$H_{0}$: There is no difference between the mean body mass of the two sub-populations.

$\alpha = 0.05$

Calculate the t-statistic using the observed data.

Adelie penguins on Dream have body mass statistics

$\bar{x}_1$ = 3688 g
$s_1$ = 455.1 g
$n_1$ = 56

Adelie penguins on Torgersen have body mass statistics

$\bar{x}_2$ = 3706 g
$s_2$ = 445.1 g
$n_2$ = 51

$t$ = -0.20624

The t-distribution of interest has degrees of freedom $n1 + n2 - 1$ = 106

What does "as extreme or more extreme" mean in this case?

Clearly, values of $t$ less than -0.20624 are "more extreme".

But also, the t statistic is based on the difference between the two means, which could be positive or negative.

Because we did not a priori have a belief about which population would have bigger body mass, we should also consider the values $t$ > +0.20624 as possible results that are "more extreme" than the observed difference.

This situation is known as a two-tailed test, where deviations of the test statistic in either direction (both the upper and lower tails) are considered relevant to our original question.

In a two-tailed test, we double the p-value associated with the tail that contains our observed value - in this case, the lower tail.

This procedure also works with test statistics having asymmetrical probability distributions.

pval = stats.t.cdf(t_obs, 106) * 2
print("p =", pval)

p = 0.8369976258647323

So p > $\alpha$ and we do not reject the null hypothesis

In practice, we can perform hypothesis tests much more easily using python functions that accept our data and perform all of the necessary calculations.

from scipy.stats import ttest_ind
result = ttest_ind(adelie_dream['body_mass_g'], 
                   adelie_torgersen['body_mass_g'], 
                   nan_policy='omit',
                   alternative="two-sided")
print("observed t =", result.statistic)
print("p-value =", result.pvalue)

observed t = -0.20624183279962707
p-value = 0.8370013888792112

Chi-square test¶

As another example of a hypothesis test, we will look at the chi-square test for independence.

This test is used to assess whether two categorical variables are independent of each other, by comparing observed frequencies to their expectations under $H_0$ (i.e. independence of the two variables).

The test statistic is

$$ X^2=\sum^k_{i=1}{\frac{(x_i-m_i)^2}{m_i}} $$

where

$x_i$ are the observed frequencies for each of $k$ outcomes

$m_i$ are the expected frequencies under $H_0$

$m_i = np_i$, where $p_i$ are the probabilities for each possible outcome.

Under the null hypothesis, $X^2$ behaves as if sampled from a $\chi^2$ distribution

The degrees of freedom for this distribution = the number of independent frequencies.

Chi-square test: example¶

Is the sex ratio of all three penguin species the same?

$H_0$: sex is independent of species

$\alpha$ = 0.05

observed frequencies:

sex	FEMALE	MALE
species
Adelie	73	73
Chinstrap	34	34
Gentoo	58	61

expected frequencies:

sex	FEMALE	MALE
species
Adelie	72.342342	73.657658
Chinstrap	33.693694	34.306306
Gentoo	58.963964	60.036036

chisq_obs = 0.04860717014078318

Degrees of freedom for a contingency table = (number of columns - 1) * (number of rows - 1)

"As extreme or more extreme" in this case refers only to the upper tail.

pval = 1 - stats.chi2.cdf(chisq_obs, 2)
print("pval =", pval)

p = 0.9759893689765846

Clearly p > $\alpha$ and we do not reject the null hypothesis.

Alternatively , using the appropriate scipy.stats function:

pval = stats.chi2_contingency(pd.crosstab(penguins['species'], columns=penguins['sex'])).pvalue
print("p = ", pval)

p =  0.9759893689765846

Correction for multiple testing.¶

p-values make sense as a way to assess evidence for individual hypothesis tests.

But we encounter a problem when we start to think about a series of multiple related tests.

Suppose that we want to test every human chromosome for overrepresentation of genes associated with a particular biological process.

We will use significance level $\alpha$ = 0.05 as usual.

   chromosome  gene_count  spermatogenesis
0           1       18332               38
1           2       13869               15
2           3       10564               25
3           4        8217               15
4           5        9534               13
5           6       10600               21
6           7       10050               14
7           8        7922               12
8           9        8235               17
9          10        8735                4
10         11       10030               24
11         12        9163               14
12         13        4792                7
13         14        6480               16
14         15        6660               14
15         16        7840               10
16         17        9604               23
17         18        3661                3
18         19        8851               27
19         20        5407               12
20         21        2888                4
21         22        5012               10
22          X        5693               16
23          Y        1626               10
24         MT         110                0

We can calculate a p-value for each chromosome. In this case using the hypergeometric distribution.

chromosome
1     0.284597
2     0.994084
3     0.141137
4     0.580636
5     0.910373
6     0.431870
7     0.902822
8     0.812496
9     0.379510
10    0.999949
11    0.135521
12    0.818817
13    0.796613
14    0.164008
15    0.371597
16    0.924934
17    0.140757
18    0.968637
19    0.009924
20    0.319097
21    0.791726
22    0.467197
X     0.073705
Y     0.001163
MT    1.000000
dtype: float64

The problem comes when we try to reject the null hypothesis.

pvals < 0.05

chromosome
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19     True
20    False
21    False
22    False
X     False
Y      True
MT    False
dtype: bool

pvals.iloc[np.where(pvals < 0.05)]

chromosome
19    0.009924
Y     0.001163
dtype: float64

Remember that the meaning of $\alpha$ is the probability of rejecting the null hypothesis, if it were actually true.

Assuming that $H_0$ IS true for every chromosome, in how many of our hypothesis tests do we expect $H_0$ to be rejected?

0.05 * 25

1.25

Clearly we do not want to be generating these false positives. We need to control the false positive rate in some way.

The simplest solution is to fix the so-called family-wise error rate (FWER) - this makes our chosen $\alpha$ the probability of ANY Type I error over all tests.

We simply use the stricter rejection threshold $\alpha/n$, where $n$ is the number of tests.

This is known as the Bonferroni correction

pvals.iloc[np.where(pvals < 0.05/25)]

chromosome
Y    0.001163
dtype: float64

The problem with the Bonferroni correction is that it is often too conservative, and will fail to detect some real positives (Type II errors).

A better solution is to control the false discovery rate (FDR). This means that out of the hypotheses that are rejected, we expect a proportion $\alpha$ of them to be Type I errors.

We convert the original p-values to so-called q-values, which we compare to the original $\alpha$.

There are several procedures to do this correction, but the most widely used is known as Benjamini-Hochberg.

qvals = pd.Series(stats.false_discovery_control(pvals), index=pvals.index)
qvals

chromosome
1     0.862523
2     1.000000
3     0.585742
4     1.000000
5     1.000000
6     0.898455
7     1.000000
8     1.000000
9     0.862523
10    1.000000
11    0.585742
12    1.000000
13    1.000000
14    0.585742
15    0.862523
16    1.000000
17    0.585742
18    1.000000
19    0.124056
20    0.862523
21    1.000000
22    0.898455
X     0.585742
Y     0.029086
MT    1.000000
dtype: float64

qvals.iloc[np.where(qvals < 0.05)]

chromosome
Y    0.029086
dtype: float64

So in fact it is only the Y chromosome that shows genuine evidence of overrepresentation for this biological process.