P-values and Statistical Tests

In this exercise session, we study basic statistical analysis tools, such as point estimates and statistical tests.

You will learn how to manipulate table-like data in GNU R.
You will learn how to compute basic descriptive aggregate statistics.
You will learn how to perform various statistical tests.
You will learn how to test whether the statistical test is applicable

The aim of the exercise session is not to give comprehensive theoretical overview of the underlying statistical techniques, rather we give a practical recipe for conducting the study. Feel free to study all intricacies by yourself.

What is GNU R?

GNU R is a programmable statistical environment similar to Matlab or Scilab. As such, it can be used in two modes: interactive and batch mode. In the interactive mode, GNU R performs like a calculator with extended data types and processing routines---pretty much like an interactive Python session. In the batch mode, GNUR behaves like a computer that read in code executes it and finally outputs the results. Throughout, the exercise session, we use the interactive mode. However, we still advise to write commands into separate text file, then it is much easier to keep track what has happened and later convert the session into a script that can be run in batch mode. In most cases, such interactive programming is the best way to write scripts. In the following text, we present all code snippets in separate boxes

> print("Hello world")
[1] "Hello world"

where the text after the prompt denotes the commands to be entered and the text without trailing > denotes the corresponding output. For longer code segments, we sometimes omit prompt signs to facilitate easier copying. Although most of the data analysis routines come with the base installation of GNU R, it is convenient to use add-on packages for specific data analysis tasks. To test, whether all necessary packages are installed type the following commands

> library(stat)
> library(ggplot2)

If you see the following type of output messages

>library(stat)
Error in library(stat) : there is no package called 'stat'

then some of the packages are not installed. To install a package, type

> install.packages("stat")

and hope for the best. In servers the installation is bit more advanced, since you do not have all the permissions and the special install directory has to be specified. However, in most cases the command just does the right thing. To find more details about the command you need to type ?install.packages. Similarly, one can ask help for any command. By typing ??term you invoke fuzzy search of the term over all possible help texts. This is useful when you do not know the exact command and do not know how to Google the term.

Basics of data manipulation

Data frames. In the context of statistical analysis, data frames are the most commonly used data structure in GNU R. Shortly put, data frame is a table with row and column names. The entries of a data frame can be either strings or real numbers. Data frames are commonly formed form column vectors. For instance, following commands

> names <- c("Alice", "Adam", "Charlie", "Elisa")
> genders <- c("Female", "Male", "Male", "Female")
> heights <- c(153, 178, 169, 190)
> tbl <- data.frame(Name = names , Sex = genders, Height = heights)
> tbl
Name Sex Height
1 Alice Female 153
2 Adam Male 178
3 Charlie Male 169
4 Elisa Female 190

generate first column vectors corresponding to attributes Name, Sex and Height, which are glued together with the command data.frame. The result can be shown by writing the resulting variable name tbl on prompt. Another way to construct data frames is to read in csv files or tab separated files as done below

> data <- read.csv("medical-data.csv", header = TRUE, as.is = TRUE)
> dim(data)
[1] 294 15

read in a csv file consisting of 294 rows and 15 columns. The command command dim always displays the dimensions of a data frame. Commands head and tail allow to see a part of the data frame. This is extremely useful for verifying whether some data manipulation commands indeed produce the desired result.

Side remark. To follow the example, you have to download the file medical-data.csv and save it to the working directory of GNU R. Commands getwd and setwd show and modify the working directory. By default the working directory is either the home directory for most graphical interfaces of GNU R and the directory from which the R command was executed in terminal.

Data access. There are several ways to access the data stored into a data frame. First, we can access a single column by issuing one of the commands

> tbl[["Name"]]
[1] Alice Adam Charlie Elisa
Levels: Adam Alice Charlie Elisa
> tbl$Name
[1] Alice Adam Charlie Elisa
Levels: Adam Alice Charlie Elisa

The result is vector of factor values (set of discrete values belonging to the fixed set shown as levels above). In most cases, factors behave exactly as ordinary strings but sometimes they are full of surprises.

Side remark. Factor is defined by the list of plausible values (levels) and the actual values. The weird behaviour occurs if you try to compare factors with different levels or assign a value that is not among plausible values. For instance, the following code snippet creates two most common errors

> x <- c("Alice", "Bob")
> y <- factor(x, levels = c("Alice", "Bob", "Charlie"))
> z <- factor(x, levels = c("Alice", "Bob", "Eve"))
> y; z
[1] Alice Bob
Levels: Alice Bob Charlie
[1] Alice Bob
Levels: Alice Bob Eve
> y == z
Error in Ops.factor(y, z) : level sets of factors are different
> y[1] <- "Eve"
Warning message:
In `[<-.factor`(`*tmp*`, 1, value = "Eve") : invalid factor level, NAs generated
> y
[1] ‹NA› Bob
Levels: Alice Bob Charlie

but there are several other hidden rocks connected to type conversions.

Second, it is possible to select several columns or even a single column as a data frame

> tbl[c("Name", "Sex")]
Name Sex
1 Alice Female
2 Adam Male
3 Charlie Male
4 Elisa Female
> tbl["Name"]
Name
1 Alice
2 Adam
3 Charlie
4 Elisa

That is, single bracket selection always yields a data frame and double bracket selection always produces a vector. Third, it is possible to select sub-matrices by specifying rows and column indices

> tbl[1:3, c("Name", "Height")]
Name Height
1 Alice 153
2 Adam 178
3 Charlie 169

The indices can be specified in terms of row and column names or in terms of row and column numbers. These numbers start form 1 and end with row count nrow(tbl) or column count ncol(tbl). For instance, the following commands are identical as Height is the third column

> tbl[1, "Height"]
[1] 153
> tbl[1, 3]
[1] 153

To alter the data, one has to select a part of the data frame and provide new values for the selection. In particular, you change the data cell by cell

> tbl[1, "Height"] <- 160
> tbl[1, ]
Name Sex Height
1 Alice Female 160

Advanced selection operations. Often, it is necessary to select rows that satisfy a certain criterion. As an illustrative example, consider the case when you want to select all male patients that have high blood pressure and are over forty. A quick check with commands colnames(data) and head(data) shows that the data frame data possesses necessary columns. To do the corresponding selection, we must execute the following command

> subset(data, Sex == "Male" & Age > 40 & Blood.Pressure > 170, select = c("Age", "Blood.Pressure", "EKG"))
Age Blood.Pressure EKG
85 46 180 ST-T wave abnormality
182 59 180 normal
246 54 200 normal
267 53 180 ST-T wave abnormality

The second argument allows us to specify all restrictions that can be formalised as a logical expression consisting of column values and the third argument allows us to select which columns to keep.

Basics of data aggregation

Standard statistical primitives. Standard univariate analysis is straightforward in GNU R, since many point estimates are accessible as commands: min, max, mean, median, sd, var, quantile and many others. Moreover, they work naturally over the data frames. For instance, the following commands produce means over the columns

> mean(data)
Age Sex Chest.Pain Blood.Pressure ...
47.8265306 NA NA NA ...

However, note that almost all means are invalid---represented by NA symbols. The data does not contain all measurements and thus some cells in the data frame are filled with NA values. It is possible to discard these in computations by setting flag na.rm = TRUE. Then the computation fails only with columns consisting of strings and factors. In most cases, the simplest way to get the initial grip of the data is just to draw summary

 > summary(data)
Age            Sex             Chest.Pain      
Min.   :28.00  Length:294        Length:294         ...
1st Qu.:42.00  Class :character  Class :character   ...
Median :49.00  Mode  :character  Mode  :character   ...
Mean   :47.83                                       ...
3rd Qu.:54.00                                       ...
Max.   :66.00                                       ...

Another extremely powerful tool is histogram plot, which is invoked by selecting first the desired column and then running the histogram drawing routine hist on the data. Unfortunately, the histogram drawing routine cannot handle categorical data (strings and factors). Hence, only the first command succeeds

> hist(data[["Blood.Pressure"]])
Example histogram

> hist(data[["Sex"]])
Error in hist.default(data[["Sex"]]) : 'x' must be numeric

For categorical data one has to use the table command

> table(data[["Sex"]])
Female Male
81 213
> table(data[c("Sex","EKG")]) EKG
Sex left ventricular hypertrophy normal ST-T wave abnormality
Female 1 58 22
Male 5 177 30

The result is either a count vector, count table or count array depending how many different columns is selected. The correspondence between continuous variables can be observed by drawing scatter plots

> plot(data[["Heart.Rate"]], data[["Blood.Pressure"]])
Example plot

> plot(data[["Age"]], data[["Blood.Pressure"]])
Example plot

Effective data aggregation. Although the basic analysis commands together with data manipulation routines are sufficient to compute sub-totals for different sub-groups, it is inconvenient, inefficient and clumsy. Far more better alternative is to use built in data aggregation routines. As an example, let us study whether Sex and Prognosis have remarkable effect on Blood.Pressure. For that, let us compute mean and standard deviation of blood pressure based on resulting sub-populations

> aggregate(data$Blood.Pressure, data[c("Sex","Prognosis")], mean, na.rm=TRUE)
Sex Prognosis x
1 Female Bad 138.5833
2 Male Bad 135.5000
3 Female Good 128.5147
4 Male Good 132.0000
> aggregate(data$Blood.Pressure, data[c("Sex","Prognosis")], sd, na.rm=TRUE)
Sex Prognosis x
1 Female Bad 21.54259
2 Male Bad 18.43836
3 Female Good 17.20248
4 Male Good 16.41000

The function aggregate takes in three basic arguments. The first argument specifies the column of values that is split and aggregated. The second argument specifies how the data is split. That is, you have to specify which columns are used to split the data. In our example, the data is split into four sub-populations according to the values of Sex and Prognosis. This argument must be always a data frame. The third argument specifies the aggregation function used to compress each sub-population into a single value. The following arguments are passed to the aggregating function. In our example, na.rm = TRUE is passed to the mean and sd functions. It is also possible to handle aggregation functions with many outputs but for that one has to use the tapply function. The function uses the same argument convention but the result is much more difficult to unpack and post-process

> tmp <- tapply(data$Blood.Pressure, data[c("Sex", "Prognosis")], summary) > tmp
Prognosis
Sex Bad Good
Female Numeric,6 Numeric,7
Male Numeric,6 Numeric,6
> tmp["Female", "Bad"]
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
100.0 127.5 136.5 138.6 152.5 180.0

How to Visualise Data

Good data visualisation is crucial for diagnostics and brainstorming, but it also provides an easy way to convey complex facts about the data. Although plot and hist are good enough for visualising the most important aspects of the data more professional looking plots and charts can be obtained with ggplot2 package. Unfortunately this package is somewhat complex to understand without reading a 200 page tutorial ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham. Hence, we provide only magical lines that are useful for visualising most prominent aspects of sub-populations. More detailed explanation of commands and introduction to other graphical primitives can be found in the website: http://had.co.nz/ggplot2/.

As a first trick, note that it is very convenient to visualise spread of the variable in different groups by drawing a jitter plot

> library(ggplot2)
> p <- ggplot(data)
> p <- p + geom_jitter(aes(x = Sex, y = Blood.Pressure), position = position_jitter(width=0.1))
> p
Example plot

The width parameter in position_jitter controls how far are the values spread along x-axis. To visualise the spread according to Sex and Prognosis it is advantageous to define auxiliary grouping factor

> data$Sex.and.Prognosis <- interaction(data$Sex, data$Prognosis)
> p <- ggplot(data)
> p <- p + geom_jitter(aes(x = Sex.and.Prognosis, y = Blood.Pressure), position=position_jitter(width=0.2))
> p Example plot

Various box-plots provide less visual but more informative graphics

> p <- ggplot(data)
> p <- p + geom_boxplot(aes(x = Sex.and.Prognosis, y = Blood.Pressure))
> p
Example plot

> p <- p + geom_jitter(aes(x = Sex.and.Prognosis, y = Blood.Pressure), position=position_jitter(width=0.2))
Example plot

The resulting graph clearly shows that the differences in means are not very far form each other if we consider the variation of blood pressure inside the groups. The biggest difference is between females with good and bad prognosis and the difference seem big enough to cause curiosity. In the following, we study how to formalise this gut feeling using statistical tests.

Hypothesis testing

Disclaimer: The following text does not intend to cover the intricacies of statistical hypothesis theory. In fact, we omit many details, as our main aim here is to give practical hands-on introductions to main concepts. Interested or puzzled reader should always consult the standard textbooks in statistics for further details and explanations.

In our last example, the differences in blood pressure for women with good and bad prognosis seemed sufficiently far apart. However, this is subjective opinion. It would look much more scientific, if we could somehow quantify it. Often, we have to choose between two hypothesis H₀ and H₁ where H₀ represents well-formulated but uninteresting case and H₁ is somewhat vaguely stated alternative. For example, we could state that the means are same up to measuring error for two sub-populations as null hypothesis H₀ and the alternative hypothesis H₁ states that means are different, but does not specify how.

In most cases, the data is generated by some sort of random process and thus it is impossible to separate cases of H₀ and H₁ without errors, as the number of observable samples is finite. Moreover, the question how well the test (a particular algorithm) works for particular dataset is meaningless. Either the dataset could be generated under both hypotheses or we know exactly which hypothesis holds and testing makes no sense.

To by-pass this conundrum, statistical tests are rated in terms of average rate of failures. Note that there are two types of errors: we can reject hypothesis H₀ if it holds or accept H₀ if the alternative hypothesis H₁ holds. The first type of error is known as a false positive and the second type of error is known as a false negative. For historical reasons, statisticians use more cryptic terms: Type I error and Type II error. Normally, a statistical test is calibrated so that the fraction of false positives is less than 5%. In most cases it is impossible to control the fraction of false positives as the alternative hypothesis is not well specified and thus we cannot quantify the probability. We return to this issue in the following examples.

Mnemonics: Whether one call Type I error a false negative or false positive depends on taste. However, as H₀ is uninteresting (negative), false positive is usually associated with the event where H₁ is reported when H₀ holds.

If we omit all the details, then we can view a statistical tests as a black-box, which takes in data and some parameters that determine desired balance between hypotheses and outputs a yes (H₀) or no (H₁) decision. Any statistical test makes some assumptions about the data. To be precise, the test usually specifies only H₀ but the usefulness (e.g. optimality or asymptotic optimality) of the test is often proved by assuming something about H₁. Hence, one has to verify these assumptions before applying the test, i.e., there are three conceptual inputs to the statistical test.

What are p-values?

The simplest way to construct statistical tests with guaranteed ratio of false positives is based on p-values. Consider an algorithm p that takes an input x (normally a list of real values or objects) and produces a single output value (aggregate value) from the range [0,1]. Now let H₀ be a specification of a data generation process, e.g. a procedure that generates data from the normal distribution N(0,1). The aggregate value is a p-value if its output distribution is uniform over the range [0,1] under the assumption that the data is generated by H₀. Consequently, if we reject hypothesis H₀ when p(x)≤ 0.05, then the probability of false rejections (the approximate fraction of false rejections in a large series of independent trials) is 5%.

Small print remarks. Note that a test with a good bound on the false positives does not automatically mean that the test also has a good bound on false negatives. For instance, a test that always accepts H₀ makes no false positive errors but the fraction of false negatives is 100%. In most cases, it is impossible to estimate the fraction of false negatives as we do not know how to formalise the alternative hypothesis. For instance, the fact that a coin is biased does not tell us how biased it is and thus it is impossible to estimate the probability that we falsely accept that the coin is fair. For different levels of bias, the probability will be different. Under certain assumptions, it is possible to prove a certain type of optimality results but these are asymptotic, i.e., the imply that it is reasonable to use the algorithm in general.

There are many types of statistical tests. Let us start with the distribution tests. In practice, one often needs to decide whether a certain measurement data follows a certain distribution. There are many such tests like Shapiro-Wilk, Kolmogorov-Smirnov and Pearson Chi-square test. As an example consider Shapiro-Wilk test, which is used to determine whether set of observed values come from normal distribution or not. The corresponding command shapiro.test in GNU R returns a complex object consisting of many parameters, as demonstrated below

> x <- c(1:10)
> shapiro.test(x)
Shapiro-Wilk normality test

data: x
W = 0.9702, p-value = 0.8924
> shapiro.test(x)$p.value
[1] 0.8923684

where the last command shows how to choose only the p-value from the entire answer.

As a first illustrating example, we demonstrate that Shapiro-Wilk test indeed computes a p-value. For that, we must show that p(x) follows uniform distribution when all samples are drawn from the normal distribution. The following code snippet generates 1000 independent samples of N element vectors x drawn form normal distribution N(0, 1) and then computes the corresponding p-values p(x).

p <- rep(NA, 1000)
for(i in c(1:1000))
{
x <- rnorm(10)
p[i] <- shapiro.test(x)$p.value
}

To visualise the resulting distribution, we can use either histograms or quantile-quantile plots as depicted below

hist(p, breaks = 50)
lines(c(0,1), c(20,20), lty = "dashed", col = "blue")
Histogram

q <- c(1:1000)/1000
qqplot(q,p, type="l")
lines(c(0,1),c(0,1), lty = "dashed", col = "blue")
QQ plot

As expected, histogram values vary a bit but generally follow the theoretical expectation. Note that only 20 values are expected to fall into each bin and thus the variance is big. For more smooth and reliable result, we should do much more trials. The quantile-quantile plot is a plot where theoretical quantiles are contrasted with empirically observed quantiles. So in the ideal case, you would see straight diagonal line between (0,0) and (1,1). A strong deviation form that line indicates that the empirical distribution does not match theoretical estimate.

Quantile-quantile plots. Let q be a vector of theoretical quantiles. For concreteness, assume that q_i is such that probability of getting x less or equal than q_i is i/10, that is, q=(0.1,0.2,...,1) for the uniform distribution. Now, let p_i be the corresponding empirical quantiles, that is, the values p_i such that the fraction i/10 of observed values are less or equal than p_i. Then the qqplot is just a scatter plot of q versus p. In other words, we could emulate qqplot with the following code.

x <- runif(100)
probs <- c(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)
q <- qunif(probs)
p <- quantile(x, probs)
plot(q,p, type = "l", lty = "dashed", col = "blue")
points(q, p, col = "red")
QQ plot

qqplot(q, x, type = "l")
QQ plot

As theoretical and empirical quantiles are contrasted, the plot shows in which regions empirical quantiles are smaller and in which regions empirical quantiles are larger than expected. As explained above, ideal match would be a straight line from (0,0) to (1,1). Of course, quantile-quantile plots can be used also for comparing two empirical distributions (read: whether two samples are presumably generated by different processes).

P-values related to standard statistical tests are independent from the size of the data sample. As the test works for all sample sizes and the computed aggregate value is indeed a p-value, it must have uniform distribution when H₀ holds regardless of the sample size. This is really different form the likelihood of the data which usually decreases exponentially with the sample size. As a consequence, the probability of false positives stays the same, whereas the probability of false negatives decreases with the sample size. Hence, the concept of statistical testing based on p-values is not entirely fraudulent: though we do not know the probability of false negatives, it is likely to be small if the sample size is large enough.

As a concrete example, consider a setting where the alternative hypothesis H₁ is known to be uniform distribution. First, lets illustrate the concept of false negatives by conducting 5000 experiments with the sample of size 10 and computing the average number of false negatives.

p <- rep(NA, 5000)
alpha <- rep(NA, 1000)
for(i in c(1:5000))
{
x <- runif(10)
p[i] <- shapiro.test(x)$p.value
alpha[i] <- sum(p[1:i] > 0.05)/i
}
plot(alpha, type="l", ylim=c(0.8, 1.0), col = "blue")
lines(c(0, 5000), c(alpha[5000], alpha[5000]), col = "red", lty = "dashed")
False negatives

The figure clearly shows that the average ratio of false negatives is rather high---over 90% of cases belonging to H₁ are labelled as H₀ if we want keep probability of false positives below 5%. Also, note how the average converges to the limiting value---initial rapid convergence quickly slows down. For those of you have studied the statistics, the result is expected, as the variance of average is of order 1/√N.

To show that the ratio of false negatives indeed decreases with the size of the sample let us repeat the computation with different sample sizes N = (10, 20, ..., 100). The following code snippet shows how to visualise the results with ggplot2. The error bars are computed according to crude 1/√N estimate.

...
df <- data.frame(N = c(1:10) * 10, alpha = alpha, alpha_min = alpha_min, alpha_max = alpha_max)
p <- ggplot(df) + geom_point(aes(x = N, y = alpha))
p <- p + geom_errorbar(aes(x = N, ymin = alpha_min, ymax = alpha_max))
p <- p + geom_line(aes(x = N, y = alpha), linetype = "dotted", color = "blue")
p <- p + scale_x_continuous("Sample size") + scale_y_continuous("Fraction of false positives")
p

The figure clearly shows that for the uniform distribution the probability of false negatives quickly drops into acceptable range and for sample sizes over 80 the probability of accepting H₀ is below 5%. Not all distributions behave so nicely. For instance, if we truncate the normal distribution N(0,1) by replacing values above 2 with 2 and values below -2 with -2, we get a distribution that is much closer to the normal distribution. For this distribution, the probability of false negatives decreases much slower

and it is not clear at all whether the probability of false negatives reaches 0 if the sample size grows to infinity.

Statistical power. Given a statistical test and a certain alternative hypothesis H₁, we have conceptually fixed the probability of false negatives. Statistical power of a test is the probability that test correctly rejects H₀ given that H₁ holds. Now given a test, we can ask whether it is optimal, i.e., for a given threshold of false positives the probability of false negatives is minimal. The complete description of such optimal tests is given by Neyman-Pearson Theorem, which provides the necessary and sufficient conditions under which the test is optimal. Unfortunately, a direct consequence of Neyman-Pearson Theorem is that no test can be optimal for all alternative hypotheses. Nevertheless, it is sometimes possible to prove that a certain test is optimal for certain family of alternative hypotheses (uniformly most powerful test). In that case, we cannot do any better if we believe that alternative hypothesis belongs to this family of distributions. In certain cases, one can prove that a test is asymptotically optimal, that is, the ratio of false negatives decreases with the same rate as for the best test designed for the alternative hypothesis. Again, these results are commonly proved by making certain assumptions about the alternative hypotheses. However, the interpretation of the guarantees are a bit different. Our test does not have to be comparable with the best test on specific sample size. Instead, the fraction of false negatives must decrease with the same rate as a function of sample size---we could only hope to improve the behaviour by a constant given the complete description of the alternative hypothesis. In other words, controlled increase in the failure rate allows us to cover more than one alternative hypothesis.

Example code. The code used to generate all of these figures is available as shapiro-test.R.

T-test as a way to quantify whether difference in means is significant or not

The most common question asked by biologist or experimenters in general is whether the results in two groups are sufficiently different, e.g., whether the gender has a significant impact on the height of a person or whether a certain drug provides effective treatment against influenza. Usually, the results are not visually clearly separable and even if they where it is common to quantify their distinctness. More formally, let x and y be samples collected from two different groups of individuals. For instance, let x be the vector of blood pressure measurements of healthy patients and let y be the vector of blood pressure measurements of sick patients. Then we can ask whether these samples come from the same distribution or not, i.e., whether patients health influences the blood pressure or not. In such setting the t-test is one most commonly used statistical test. Under the null hypothesis both samples are drawn from the same normal distribution and the alternative hypothesis is that the samples are drawn from normal distributions with different means and possibly different variances. Under these assumptions it is possible to estimate the probability of false negatives. The corresponding functions are t.test and power.t.test.

Let us illustrate the usage of t-test by creating two datasets. In one the sub samples x and y are generated by the same distribution N(0,1) and in the other distributions are shifted by Δ. The first code snippet visualises the result when Δ is either zero or one.

data <- data.frame(Value = rnorm(200))
data$Group <- c(rep("Case",100), rep("Control", 100))
p <- ggplot(data)
p <- p + geom_boxplot(aes(x = Group, y = Value))
p <- p + geom_jitter(aes(x = Group, y = Value), position=position_jitter(width=0.2))
p <- p + scale_x_discrete("") + scale_y_continuous("")
p
Case-control study

x <- subset(data, Group == "Case")$Value
y <- subset(data, Group == "Control")$Value
t.test(x, y)
Welch Two Sample t-test

data: x and y
t = 0.0573, df = 196.558, p-value = 0.9544
...

data <- data.frame(Value = c(rnorm(100, mean = 0), rnorm(100, mean = 1)))
data$Group <- c(rep("Case",100), rep("Control", 100))
p <- ggplot(data)
p <- p + geom_boxplot(aes(x = Group, y = Value))
p <- p + geom_jitter(aes(x = Group, y = Value), position=position_jitter(width=0.2))
p <- p + scale_x_discrete("") + scale_y_continuous("")
p
Case-control study

x <- subset(data, Group == "Case")$Value
y <- subset(data, Group == "Control")$Value
t.test(x, y)
Welch Two Sample t-test

data: x and y
t = -8.863, df = 196.832, p-value = 4.656e-16
...

The results are intuitive. Since the null hypothesis is fulfilled in the first case, the p-value is rather high and the distributions are more or less visually indistinguishable. For the second case, the visual distinction is clearly observable and the corresponding p-value is extremely low. That is, mathematical formalisation supports our intuition.

Note that the smaller the difference &Delta is the higher number of samples are needed to reliably detect H₁. As illustration, consider the minimal number of samples in a group needed to detect difference, when the upper bound to false positives and negatives is set to 5% and both distributions are assumed to have unit variance.

delta <- c(1:100)/50
n <- rep(NA, 100)
for(i in c(1:100))
{
n[i] <- ceiling(power.t.test(delta = delta[i], sd=1, sig.level=0.05,
&emsp power = 0.95, type="two.sample", alternative = "two.sided")$n)
}
df <- data.frame(N = n, Delta = delta)
p <- ggplot(df)
p <- p + geom_line(aes(x = Delta, y = N), linetype = "dotted", color = "blue")
p <- p + geom_point(aes(x = Delta, y = N), color = "red")
p <- p + scale_y_log10("Minimal group size") + scale_x_log10("Difference of means")

The log-log plot shows that the sample size is polynomial in the inverse of &Delta (the approximate relation id N ∝ &Delta^-2) so the sample sizes do not become infeasible if difference between mean becomes smaller. However, the graph also shows that detecting small effects reliably using practical sample sizes N≤1000 can be problematic. The issue is particularly critical, if we consider common sample sizes in bioinformatics.

df <- c()
thr <- 0.05/c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000)
for(n in c(12, 25, 50, 100))
{
delta <- rep(NA, length(thr))
for(i in c(1:length(thr)))
{
delta[i] <- power.t.test(n = n, sd = 1, sig.level = thr[i], power = 0.95,
type="two.sample", alternative = "two.sided")$delta
}
df <- rbind(df, data.frame(N = n, Thr = thr, Delta = delta))
}

p <- ggplot(df)
p <- p + geom_line(aes(x = Thr, y = Delta, group = factor(N), color = factor(N)), linetype = "dotted")
p <- p + geom_point(aes(x = Thr, y = Delta, group = factor(N), color = factor(N)))
p <- p + scale_y_continuous("Difference of means", breaks = c(0.5, 1, 1.5, 2, 2.5, 3), formatter = "comma")
p <- p + scale_x_log10("Significance threshold") + scale_colour_discrete("Group size")
p
Minimal effect size

The figure clearly shows that all detectable must be quite large when the significance threshold is decreased by the factor of 100 or 1000. For the twelve element groups, the means must be separated by 3σ-s, which is a really rare and strong effect. These limitations are particularly relevant when we test many hypotheses together---in these settings one must divide the significance threshold by the number of hypotheses to be tested. Consequently, it is really important to reduce the number of hypotheses before testing using either biological background knowledge or some other criteria. We return to this issue in later sections.

What is Paired T-test Useful for

In many cases, the values measured in two groups are linked together. For instance, if we study the effect of a drug on blood pressure, measurements of a patient before and after form a pair and we can ask a natural question "Whether the treatment decreases the blood pressure of individual patients?" rather than pondering "Whether the overall level of blood pressure is lower afterwards?" When the in-group variance is large, clearly observable changes in the individual level is not so clear in the group level. Hence a paired t-test which utilises this type of natural links can be much more sensitive (more powerful).

x <- rnorm(20)
y <- x + rnorm(20, mean = 0.1, sd = 0.05)
df <- data.frame(X = x, Y = y)

p <- ggplot(df)
p <- p + geom_boxplot(aes(x = "Case", y = X)) + geom_boxplot(aes(x = "Control", y = Y))
p <- p + geom_point(aes(x = "Case", y = X)) + geom_point(aes(x = "Control", y = Y))
p <- p + geom_segment(aes(x = "Case", y = X, xend = "Control", yend = Y), color = "blue", linetype = "dotted")
p <- p + scale_x_discrete("") + opts(aspect.ratio = 2)
p
Paired measurements

t.test(x, y, paired = FALSE)
Welch Two Sample t-test
data: x and y
t = -0.2556, df = 37.999, p-value = 0.7997
...
t.test(x, y, paired = TRUE)
Paired t-test
data: x and y
t = -8.034, df = 19, p-value = 1.574e-07

Hence, one should always try paired tests whenever it is possible to group data into pairs. As a technical remark, note that the test pairs x_i with y_i and any error in orderings of x and y vectors can lead to gibberish. Simple pairing is only one way to group the data inside sub-populations to be compared. In some cases, natural subgroups consists of three elements, e.g., whether a height of sons is consistently higher than the heights of mothers and fathers. Such tests are often regarded as tests on stratified data and the way the data is split is usually defined by some block design.

Although paired t-test is often more sensitive, standard t-test can beat the paired test in settings where the individual changes are more random than the overall average change between the case and control group. The file data-generators.R defines three data sources: DataSource1, DataSource2 and DataSource3. Experiment with them to see which of them produces samples that are more different in group level than in the individual level. The functions return list of x and y vectors, which can be accessed as in the following example.

source("data-generators.R")
df <- DataSource1(100)
t.test(df$X, df$Y)

T-tests Are not Robust

T-test comes with certain limits. In particular, the test is not guaranteed to give sensible results when the assumptions are not satisfied, i.e., the in-group distribution of measurements is not normal nor is it "close" to normal distribution. As an illustration consider a setting where the measurements come from the same distribution, but the distribution is different form the normal distribution. Let us first observe the case when the distribution is normal distribution with outliers.

# Let us define distribution NormalDistWithOutliers <- function(n)
{
runif(n) + 10000* (runif(n) < 0.001)
}

If we draw vectors x and y vectors of size 5, 25, 50 from this distribution

df <- c()
for(k in c(5, 25, 50))
{
p <- rep(NA, 1000)
&emps;for(i in c(1:5000))
{
p[i] <- t.test(NormalDistWithOutliers(k), NormalDistWithOutliers(k))$p.value
}
df <- rbind(df, data.frame(N = k, Pvalue = p))
}

then the resulting distribution should contain one p-value below 2⋅10^-4 10 p-values below 2⋅10^-3 and 100 p-values below 2⋅10^-2 when the computed value is indeed a p-value. In practice, the result is quite different for small sample sizes.

p <- ggplot(subset(df, Pvalue <= 0.05))
p <- p + geom_jitter(aes(x = Pvalue, y = 0.1, colour = factor(N)), position = position_jitter(height = 0.1))
p <- p + facet_grid(N~.) + opts(legend.position = "none")
#Ugly hack that works now and should break later when the package is fixed
p <- p + geom_vline(xintercept = log10(2e-04), linetype = "dashed", colour = "red")
p <- p + geom_vline(xintercept = log10(2e-03), linetype = "dashed", colour = "green")
p <- p + geom_vline(xintercept = log10(2e-02), linetype = "dashed", colour = "blue")
# End of hack
p <- p + scale_x_continuous("Pvalue", trans = "log10")
p <- p + scale_y_continuous("") + opts(axis.text.y = theme_blank(), axis.ticks = theme_blank())
p
T-test failure

Namely, if there is a single outlier p-value calculation is completely distorted. Hence, the number of low p-values is quite large for small samples. Differences for higher p-value thresholds are smaller---the probability of seeing an outlier is rather small and thus also the effect on the distribution of computed p-value is minuscule. To put it in an other way, interesting low p-value cases in such setting are most probably generated by outliers and not by the big differences in the means of both groups.

Similar or even bigger aberrations occur for long-tail distributions such as Laplacian distribution

LaplaceDist <- function(n)
{
rexp(n)*(-1)^(runif(n) < 0.5)
}
...
T-test failure

and discrete distributions with few states such as Bernoulli distribution. Since a direct repetition of the experiment causes explicit t-test failures, we show the results for the bimodal distribution

BimodDist <- function(n)
{
round(runif(n))+ 0.001*runif(n, min = -1, max = 1)
}
...
T-test failure

Note those extreme p-values, which should never occur in such a small sample.

The entire code of the experiment can downloaded as a file t-test-failure.R.

Wilcoxon Test as a Robust Analogue of T-test

T-tests are fragile as the p-value is computed based on mean and variance. These point estimates are also known to be fragile against outliers. There are two principal ways to fix the problem. First, we can use more robust estimates for the mean and variance. The latter solves the issue of outliers but the test will still fail if the measurements are not distributed according to the normal distribution. Wilcoxon tests are the second type of alternative that uses non-parametric statistics. In layman terms, such tests order the data, throw away the values and then the p-value is computed solely based on the ordering. As a result, outliers have limited impact on the outcome. The corresponding test is known as Mann-Whitney-Wilcoxon or Wilcoxon rank-sum test and it has the same syntax in GNU R as t-test and as before one can choose between unpaired and paired tests by setting the flag paired.

> x <- rnorm(10)
> y <- rnorm(10, mean = 1)
> wilcox.test(x, y)
Wilcoxon rank sum test
data: x and y
W = 55, p-value = 0.7394
alternative hypothesis: true location shift is not equal to 0
> wilcox.test(x, y, paired = TRUE)
Wilcoxon rank sum test
data: x and y
W = 55, p-value = 0.7394
alternative hypothesis: true location shift is not equal to 0

The main difference between t-test and Wilcoxon test lies in the null hypothesis and assumptions. The null hypothesis states that measurements of both group come from the same distribution, whereas the alternative hypothesis states that distributions are different in the following sense. If you draw a sample from these distribution the probability that the first sample is larger than the other is not 0.5. Secondly, Wilcoxon test is a bit less powerful, that is, if the assumptions of t-test are satisfied, then the p-values form the Wilcoxon test are constantly larger than the p-values form t-test, as demonstrated by the following examples.

p <- rep(NA, 1000)
q <- rep(NA, 1000)
for(i in c(1:1000))
{
x <- rnorm(10)
y <- rnorm(10, mean = 1)
p[i] <- t.test(x,y)$p.value
q[i] <- wilcox.test(x,y)$p.value
}
indx <- order(p) plot(log(p[indx], 10), log(q[indx], 10), xlim=c(-6,0), ylim=c(-6,0)) lines(c(-6,0), c(-6,0), col="red") Comparison of tests

p <- rep(NA, 1000)
q <- rep(NA, 1000)
for(i in c(1:1000))
{
x <- rnorm(100)
y <- rnorm(100, mean = 1)
p[i] <- t.test(x,y)$p.value
q[i] <- wilcox.test(x,y)$p.value
}
indx <- order(p) plot(log(p[indx], 10), log(q[indx], 10), xlim=c(-25,0), ylim=c(-25,0)) lines(c(-25,0), c(-25,0), col="red") Comparison of tests

As it can be seen from the figures, the difference between the p-values increases with the sample size and the p-values itself. Nevertheless, the difference is not dramatic, if the difference is significant according to the t-test is likely be significant according to the Wilcoxon test, as well.

Null hypothesis. Above we simplified the definition of the null hypothesis for the Wilcoxon test. The actual null hypothesis states that instead of sorting observed values and assigning ranks to them, we could assign ranks just randomly without looking at the data. This assumption is clearly satisfied when the data in both groups is generated by the same distribution, as all permutations of the data have exactly the same probability or density value. It is somewhat non-trivial to construct a two distribution that satisfy this assumption but are not identical, but such distributions do exist. The additional assumption made about alternative hypothesis is sufficient to rule out the null hypothesis.

Multiple testing and p-value correction

Although statisticians do not like to talk about it, there are some restrictions you must comply before applying any of the statistical methods. In particular, you should design the methodology before seeing the data. In our case, you should fix the hypothesis to be tested before you get the data. In practice, such situations rarely occur. Unless you are the biologist who conducts the experiments, you are always presented the data first. The fact that you see the data before fixing the methodology allows you to cheat. For instance, it is almost impossible to choose 100 locations from 300 throws of a fair coins so that all of them are heads. However, it is trivial to do it after the experiment.

> x <- round(runif(300))
> which(x==1)[1:100]

You can never be sure that such type of blatant cheating does not occur when you apply complex statistical analysis or machine learning algorithms. Hence, there are specific techniques to limit potential damage.

P-value correction is one of such techniques used in case of multiple hypothesis testing. Coming back to our experiment, if you choose 100 locations before seeing the data, the probability that all of them are heads is 2^-100≈7.9⋅10^-31 and thus seeing heads only is highly significant. However, when you choose 100 locations from 300 locations afterwards, the result is not even marginally significant, as there are 4.2⋅10⁸¹ possible locations potential location the analyst can choose from. To put it in other way, before seeing the data you have not committed to any particular hypothesis, rather you try all 4.2⋅10⁸¹ potential hypothesis in parallel. When the data arrives, you present only the best result and abandon all other hypotheses that are not consistent with the data. More generally, assume that a statistical test T_i holds with probability p_i under the null hypothesis, then under the null hypothesis one of the hypotheses holds with probability at most p₁+⋅⋅&sdot +p_n. Hence, if we want to preserve 5% significance level even if we are allowed to choose among n hypotheses, we must make sure that the sum of individual significance thresholds remains below 5%. Usually, this is achieved by multiplying individual p-values by the number of performed statistical tests. This p-value correction method is known as Bonferroni correction and is invoked as follows.

>p.adjust(pvalues, method="bonferroni")

At the first glance, the Bonferroni correction is rather crude, as it is based on a quite loose union bound. However, simulations show that the bound is rather precise when the hypotheses are independent (e.g. stated in terms of independent variables).

beta <- c()
N <- c(1, 10, 100, 1000, 10000)
for(n in N)
{
result <- rep(NA, 10000)
for(i in c(1:10000))
{
result[i] <- any(p.adjust(runif(n), method = "bonferroni") < 0.05)
}
beta <- c(beta, sum(result)/10000)
}
beta_min <- beta - 1/sqrt(10000)
beta_max <- beta + 1/sqrt(10000)
df <- data.frame(N = c(2, 5, 10, 50, 100), beta = beta, beta_min = beta_min, beta_max = beta_max)
p <- ggplot(df) + geom_bar(aes(x = factor(N), y = beta), fill = "aquamarine")
p <- p + geom_point(aes(x = factor(N), y = beta))
p <- p + geom_errorbar(aes(x = factor(N), ymin = beta_min, ymax = beta_max), width = 0.5)
p <- p + geom_line(aes(x = factor(N), y = beta), linetype = "dotted", color = "blue")
p <- p + scale_x_discrete("Sample size")
p <- p +scale_y_continuous("Fraction of false positives", limits=c(0, 0.07))
p
Simulation results

The underlying reason for such a surprising result is rather simple. Bonferroni correction reduces the significance threshold proportionally to the number of hypotheses. As a result, the probability of more than one false positives occurring at the same time remains small, although the number of trials increases. The precise theoretical bound on false positives shows some decrease but remains rather close to 5%.

p <- rep(NA, 1000)
for(N in c(1:1000))
{
p[N] <- pbinom(0, 10 * N, prob = 0.05/(10 * N), lower.tail = FALSE)
}
df <- data.frame(x = x, y = p)
p <- ggplot(df) + geom_line(aes(x=x, y=y))
p <- p + geom_hline(yintercept = 0.05, linetype = "dotted", colour = "red")
p <- p + scale_x_continuous("Number of independent hypotheses")
p <- p + scale_y_continuous("Fraction of false positives", limits=c(0.045, 0.05))
p
Theoretical bounds

To summarise, Bonferroni correction is rather precise in practice, as one mostly tests hypotheses that are independent or nearly independent. On the other hand, Bonferroni correction dramatically increases the proportion of false negatives. In many cases, this is not a good approach, especially, when hypothesis testing is used as a first filter towards interesting variables or phenomena. In such cases, we can tolerate false positives as long as they form only a limited fraction of all tests which rejected H₀. The corresponding p-value adjustment method is known as FDR (False Discovery Rate) and it is invoked as follows

> p.adjust(pvalues, method = "fdr")

After the correction, 5% cutoff threshold assures that on average there are at most 5% of passed tests are false positives, 10% cutoff there are at most 10% of passed tests are false positives are selected and so on.

Use of background information. Even with liberal p-value correction method, the number of hypotheses to be considered directly reduces power of statistical tests---some effects which would be detectable on individual level remain undetected in multiple hypothesis testing. Hence, it is important to reduce the set of hypotheses before testing. If there are some biological or statistical background information, which allows to eliminate a large portion of plausible hypotheses, one should always do it. Of course, there are some potential drawbacks, as well. Sometimes, background information is fraudulent and thus effectively blocks us from discovering new relevant facts. Sometimes, background information is tainted---it uses the same dataset to filter out some hypotheses and thus gives us a false sense of confidence. In other words, use background information whenever it is possible, but also check whether the background information is relevant and applicable in your context.

The dangling threat of spurious results

Missing:. Write what is the difference between correlation and causation and why ice cream eating in not the main cause of drownings.