Commit 8cab390e authored by Thomas Schwarzl's avatar Thomas Schwarzl

Merge branch 'master' of git.embl.de:schwarzl/intro-to-bioinf

parents 7e902575 ea357ed3
......@@ -11,6 +11,8 @@ output:
BiocStyle::pdf_document:
toc: true
highlight: tango
editor_options:
chunk_output_type: console
---
Materials adapted from Bernd Klaus
......@@ -21,7 +23,6 @@ Materials adapted from Bernd Klaus
library(tidyverse)
library(psych)
library(factoextra)
library(plotly)
library(ComplexHeatmap)
library(viridis)
library(circlize)
......@@ -71,7 +72,7 @@ learning algorithm based on the original image features. Each cell is classified
into one of 10 mitotic phenotype classes.
```{r}
input_data <- readRDS("./data/pca_input_data.rds")
input_data <- readRDS("../data/pca_input_data.rds")
(input_data <- as_tibble(input_data))
```
......
---
title: "Introduction to hypothesis testing"
author: "Florian Huber"
date: "`r Sys.Date()`"
output:
bookdown::html_document2:
toc: true
toc_float: true
highlight: tango
code_folding: show
BiocStyle::pdf_document:
toc: true
highlight: tango
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
library(tidyverse)
```
# Introduction to hypothesis testing
When we do empirical science we are interested in comparing groups or in
assessing the impact of one variable on the other. Typically, we ask questions
such as "Is the average expression of those 2 genes different between the cell
types?", "Is growth affected by an increase in the drug concentration?" or, in
an omics experiment "Which proteins increase in abundance after knocking out
gene X?"
Unfortunately, we cannot simply compare measurements between groups because all
measurements are associated with a certain error. Whenever we
observe a difference it could be simply due to chance. Estimating when a
difference is more likely to be due to chance and when it is more likely to
be the result of a real difference is the aim of hypothesis testing.
## "Warm-up": distributions of random variables
Random variables can follow lots of different distributions. Important
distribution types are, for example, the normal, binomial, Chi-square, Fisher,
or the t distribution.
You can always draw random samples from those distributions using the functions
`r_dist.type()`, so `rnorm(3)`, for example, will return 3 randomly drawn
observations from a normal distribution.
```{r}
rnorm(3)
rbinom(n = 1, size = 3, prob = 0.5)
rt(3, df = 2)
```
When we draw a sample that is large enough, the histogram approaches the
mathematical shape of the corresponding probability density function. For many
distributions we need only one or a few parameters to completely characterise
the distribution. For example, the mean and the standard deviation are enough
to completely specify a normal distribution. Therefore, those distributions are
also called *parametric*.
```{r}
v <- data.frame(x = rnorm(10000))
mean(v$x)
sd(v$x)
ggplot(v, aes(x = x)) +
geom_histogram(aes(y = ..density..)) +
geom_density(colour = "red")
```
For all the distribution functions mentioned above you can extract
the density at a point x using `dnorm()`, (and `dt()`, `dbinom()` etc.).
```{r}
dnorm(0)
```
The probability to draw a sample smaller than X (P(X <= x)) is given by
`pnorm()`, `pbinom()` etc.
```{r}
pnorm(0)
```
The inverse can be found by using a *quantile function*: `qnorm(p)` returns the
quantile ("value") at which the probability to draw a smaller than or
equal to this value is `p`.
```{r}
qnorm(0.5)
qnorm(0.975)
```
## Hypothesis tests: basic idea
All hypothesis tests share the same basic idea. We first formulate a so-called
"null hypothesis" (H0). The null hypothesis is a statement about the
distribution of a random variable (e.g. a measurement value). For example, we
could say: "My null hypothesis is that my OD values are normally distributed
with a mean of 0.5 and a standard deviation of 0.1". Then we perform a
measurement and observe a value of, say, 0.6. The probability of this, or a
more extreme (i.e. even less probable) value to
occur by chance, given that H0 is true, is `r round(1-pnorm(0.6, mean = 0.5, sd = 0.1), digits = 2)`.
Hence, this event is not that unlikely. But when do we say that an event is so
unlikely that we reject H0? This is a matter of convention but typically if the
probability is lower than 0.05 or 0.01 (the so-called
*significance level alpha*)
the convention is to reject H0. In alpha percent of the cases we will do so
even though H0 is correct. This type of error is also known as *type I error*.
In the case of a standard normal distribution this means: if a measurement is
drawn that is more than 1.96 standard deviations away from 0 then, because this
event has a probability less than 5%, we would reject H0 and not believe that
it comes from a population with mean = 0 and standard deviation = 1.
## Hypothesis tests: the t test
In many measurements we are interested in a question that is very similar to
the one above. Often, we are interested if the *mean* of a population is
different from, say, 0. Or we want to compare two groups and we want to know if
their means are different. In the former case our null hypothesis would be
that the mean is 0.
When we calculate the mean from a sample, the resulting mean is again
normally distributed. The standard deviation of such a mean is of course
smaller than the standard deviation of the distribution it has been calculated
from. More precisely its standard deviation is $\sigma/\sqrt(n)$ where
$\sigma$ is the standard deviation of the population.
$\sigma/\sqrt(n)$ is also known as the *standard error of the mean*. It is the standard
deviation of the mean. Statisticians call the standard deviation of parameters
that we are estimating *standard errors*.
Then the standardised z-score (our "test statistic") is given by
![](zscore.png){width=20%}
Z is normally distributed and we could check if the likelihood of that z-score
to occur is smaller than some significance level alpha.
However, the problem is that we usually do not know the standard deviation of
the population. Instead, we have to estimate it using the standard deviation of
the sample. The resulting test statistic behaves slightly different from the
z-score above. It is now called a *t statistic* and is not normally distributed
but it is t-distributed with n-1 degrees of freedom, n being the size of the
sample that has been drawn:
![](tstat.png){width=20%}
We can actually show that this is true by simulating some data.
First, we repeatedly "perform 3 measurements" from a random variable with
a mean of 0 and standard deviation of 1.
```{r}
samplesize <- 3
my_tstats <- replicate(10000, rnorm(samplesize), simplify = FALSE)
my_tstats <- tibble(sample = my_tstats)
head(my_tstats)
```
Next, we calculate for each sample drawn its mean and the standard deviation of
the man = standard error of the mean (sem).
```{r}
my_tstats$mean <- map_dbl(my_tstats$sample, mean)
my_tstats$sem <- map_dbl(my_tstats$sample, function(x) {
sd(x) / sqrt(length(x))
})
```
Next, we calculate the t-statistic for each sample. This is like a standardised
version of the mean of the sample we have drawn.
```{r}
get_tstatistic <- function(x) {
sqrt(length(x)) * (mean(x)/sd(x))
}
my_tstats$tstatistic <- map_dbl(my_tstats$sample, get_tstatistic)
```
Just as a comparison we also make a column containing 10000 randomly drawn
observations of a t-distribution with n-1 degrees of freedom. n is our sample
size. Remember that the t-statistic we have calculated should be t-distributed.
We will check if this is true by plotting both distributions. The numbers
sampled from a t-distribution we will plot in red, the t-statistics derived
from our samples we will plot in grey.
```{r}
my_tstats$rt <- rt(10000, df = samplesize)
my_tstats
ggplot(my_tstats) +
geom_histogram(aes(x = rt), alpha = 0.5, binwidth = 0.1, fill = "red") +
geom_histogram(aes(x = tstatistic), alpha = 0.5, binwidth = 0.1,
fill = "grey") +
coord_cartesian(xlim = c(-5, 5)) +
theme_bw()
```
## Confidence intervals
Next, we will explore what the meaning of confidence intervals is. 95%
confidence intervals are calculated such that they cover 95% of the data that
we would expect to occur by chance. Since our means are t-distributed, we need
the 97.5% t-quantile.
```{r}
t_quantile <- qt(0.975, df = samplesize-1)
```
Using this, we can construct a confidence interval for all samples.
```{r}
my_tstats$lower <- my_tstats$mean - t_quantile * my_tstats$sem
my_tstats$upper <- my_tstats$mean + t_quantile * my_tstats$sem
my_tstats
```
How many of the confidence intervals contain the true mean, which we know to be
0?
```{r}
my_tstats$contains_mean <- my_tstats$lower < 0 & my_tstats$upper > 0
sum(my_tstats$contains_mean) / nrow(my_tstats)
```
So 95% confidence intervals are defined as follows: if we construct 95%
confidence intervals an infinite number of times, then the true mean will be
contained in 95% of these confidence intervals.
In summary, we asked: given a certain null hypothesis, which we in this case
knew to be: a normal distribution with mean = 0 and sd = 1 (the "null"), how
will the standardised mean (the t-statistic) derived from a sample size of 3
drawn from the null distribution behave?
This kind of reasoning is common to all hypothesis tests: we make a certain
assumption of how our measurement values behave assuming a certain
distribution. If the measurement value we get is then very unlikely (i.e. less
likely than our "significance level" $\alpha$ to occur given this "null
hypothesis", then we reject the null hypothesis.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment