Suppose we’re measuring some quantity, like the height of a person. Before we take the measurement, we can only make a reasonable guess about the measurement - that it will probably be something between 140 and 200 cm. We model this incertainity by assuming that the height is a random variable \(X: \Omega \rightarrow \mathbb{R}_+\) with some probability distribution, e.g. normal.
When we take this measurement, we perform a realization of \(X\) - that is, we “draw” \(\omega \in \Omega\) and substitute to get \(X(\omega)\), which now is a number, e.g. 178,2 cm. Performing a statistical experiment - a measurement, or a simulation - is nothing else but fixing \(\omega\) in our random variables.
We usually assume that each measurement is a realization of an independent random variable. Therefore, our data consists of a set of realizations of random variables, \(X_1 (\omega), X_2 (\omega), \dots, X_n(\omega)\). We usually assume that we know something about the probability distributions of \(X_1, \dots, X_n\), and that the distributions depend on some unknown parameters like the mean or variance. We say that the distributions of \(X_i\) belong to some family of distributions (for example, the family of all normal distributions with variance equal to 0.1). These assumptions are the base of all statistical reasoning, like parameter estimation or statistical tests.
The distributions of our random variables can be either assumed by looking at the data, or derived from some assumptions about the random process which generates them. In case of height, we can use both approaches. First, if we measure a lot of people, we will notice that the histogram resembles the normal distribution. Second, we know that height is determined by lots of factors (in fact, over 180 genes and numerous environmental conditions). Quantities which depend on many independent factors usually follow a normal distribution because of the Central Limit Theorem.
Your tasks:
We will now cover basics of data exploration in R. In most applications, we will keep our actual data in data frames. Data frames are simply tables with data. Load and inspect an example data frame, called iris
, by typing
data(iris)
View(iris)
iris[1:4, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
The iris
dataset contains information about certain measurements of several species of the Iris flower: Iris virginica, Iris versicolor and Iris setosa.
Columns of data frames are named. We can refer to those columns either by square brackets, like in matrices, or by their names. For example, typing iris$Species
is the same as iris[, 5]
.
iris[1:10,5]
## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
iris$Species[1:10]
## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
Data frames are created as follows:
example.frame <- data.frame("Name1" = rep(1:10, 2), "Name2" = rep(c("A", "B"), each=10))
Your tasks:
two.distributions
## Name Value
## 1 A -2.68133300
## 2 A 1.80512134
## 3 A -0.15617911
## 4 A -0.03401183
## 5 A 6.38742729
## 6 A 2.25297318
## 7 A 1.53619089
## 8 A -3.68275074
## 9 A 0.29859534
## 10 A 8.93306519
## 11 B -2.98121368
## 12 B 3.00001908
## 13 B -1.55908853
## 14 B -1.24934815
## 15 B -2.99439260
## 16 B 0.81269739
## 17 B -1.12969737
## 18 B -4.15613732
## 19 B 3.75391770
## 20 B -4.17590456
Compare the empirical means of the observed values. Does the difference in empirical means agree with the real means of the distributions?
tapply(X, I, F)
applies function F
to values of vector X
grouped by the values of index I
. For example, tapply(c(1, 2, 3, 4), c("A", "A", "B", "B"), sum)
will apply the function sum
to vectors c(1, 2)
and c(3, 4)
, and return a vector c(3, 7)
. This function is extremely useful in tasks like this one. Use it to apply the function mean
to observed values, grouped by the distribution names.empirical.means
## A B
## 1.465910 -1.067915
We will now cover the ggplot2 library more extensively. We have already used function qplot()
, which is a much simplified function used for fast plotting (qplot
stands for quick plot). A more versatile function is ggplot()
.
To create a plot using ggplot()
, we need to store our data in a data frame. By typing ggplot(some_data_frame)
, we create a basic plot object, which can be assigned to a variable. To get an actual plot, we need to add a function specifying the plot’s geometry (histogram/line plot/bar plot etc.). For example, to create a histogram of petal width from the iris
dataframe, type:
iris.plot <- ggplot(data = iris)
iris.plot + geom_histogram(aes(x=Petal.Width))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The aes
parameter specifies the aesthetics, i.e. variables plotted on axes. By default, geom_histogram
accepts the name of the variable plotted on x axis and plots the counts on the y
axis. Additionally, we can pass the variables mapped to different colour by specifying the fill
aesthetics, and specify the width of bins of the histogram:
iris.plot + geom_histogram(aes(x=Petal.Width, fill=Species), binwidth=0.1)
To create a scatter plot representing the dependence of petal width on petal length, we would use the geom_scatter()
function like this:
iris.plot + geom_point(aes(x=Petal.Length, y=Petal.Width, colour=Species))
It is important to remember to pass the names of the variables to the aes()
function without quotation marks. Compare the previous plot with the following one:
ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, colour="Species"))
Your tasks:
two.distributions
data frame. Highlight the distributions with different colors.two.plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In one of the previous examples, the empirical means of two distributions could seem a bit odd. In fact, sometimes the empirical mean of observations from distribution B
is smaller than the empirical mean of observations from distribution A
, even though the true mean of B
is greater than the true mean of A
.
Suppose that we don’t know the true means of those distributions, and we want to compare them to check if the true means are different. For example, suppose that those distributions measure the effectiveness of two medical treatments. How can we check if treatment B
is really better than treatment A
?
We already know that the empirical means are different. This difference is caused either by actual difference in means, or by random noise. To check if a difference is significant, i.e. non-random, we need to run a statistical test.
We will use the two-sample t-test, to check if we can assume that differences in empirical means are non-random. The t-test compares the estimated, normalized means. The test statistic for two samples \(X_1, \dots, X_n\) and \(Y_1, \dots, Y_m\) is
\[T = \frac{\bar{X} - \bar{Y}}{\sqrt{S_X^2/n + S_Y^2/m}},\]
where \(S_X^2\) is the estimator of the variance. If the variables are normally distributed, then the \(T\) statistics follows the Student’s t distribution with \(n+m-1\) degrees of freedom. If the variables have a different distribution, then the Central Limit Theorem guarantees convergence to the Student’s t distribution.
If the value of the computed \(T\) statistic is very large or very small, then we’re fairly certain that the means are different. To check whether the value is small or large enough, we need to compute the probability that the absolute value of \(T\) statistic is larger than the observed value. This probability is called the p-value.
In our case, since the density of Student’s t distribution is symmetric, the p-value is equal to the probability
\[ \mathbb{P}(T > |T(\omega)|) + \mathbb{P}(T < -|T(\omega)) = 2\mathbb{P}(T < -|T(\omega)|),\]
where \(T(\omega)\) is our observed value. The values of the cumulative distribution function Student’s t distribution can be computed numerically.
Your tasks:
Write a function which runs the Student’s t test for two samples and returns the p-value.
pt(x, df)
for the cumulative distribution function at point x
for the distribution with df
degrees of freedom.T
, which is a boolean variableUse your function to run a t-test on the values from two.distributions
data frame. Compare your results with the results of the function t.test
Check what happens if you use 1000 and 100 000 observations in two.distributions
.
distr.A <- two.distributions$Value[two.distributions$Name == "A"]
distr.B <- two.distributions$Value[two.distributions$Name == "B"]
paste("My p-value:", my.t.test(distr.A, distr.B))
## [1] "My p-value: 0.106185245293114"
t.test(distr.A, distr.B)
##
## Welch Two Sample t-test
##
## data: distr.A and distr.B
## t = 1.6961, df = 16.503, p-value = 0.1086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6252438 5.6928931
## sample estimates:
## mean of x mean of y
## 1.465910 -1.067915
The p-value is large. This means that there is a large probability of observing a larger difference in means. In other words, our difference is very typical, so we can’t assume that the treatments A
and B
give different effects. The difference in empirical means is purely due to random noise.
t.test(rnorm(1000, -0.1, 3), rnorm(1000, 0.1, 3))
##
## Welch Two Sample t-test
##
## data: rnorm(1000, -0.1, 3) and rnorm(1000, 0.1, 3)
## t = -0.81239, df = 1997.6, p-value = 0.4167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3728548 0.1544311
## sample estimates:
## mean of x mean of y
## -0.108880844 0.000330997
t.test(rnorm(100000, -0.1, 3), rnorm(100000, 0.1, 3))
##
## Welch Two Sample t-test
##
## data: rnorm(1e+05, -0.1, 3) and rnorm(1e+05, 0.1, 3)
## t = -16.029, df = 2e+05, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2412697 -0.1886963
## sample estimates:
## mean of x mean of y
## -0.1114743 0.1035087
Due to large variance, we need 100 000 observations to get a p-value which is small enough to allow us to conclude that there is a difference in treatments.
ggplot(data=data.frame("Name" = rep(c("A", "B"), each=10000), "Value" = c(rnorm(10000, -0.1, 3), rnorm(10000, 0.1, 3)))) + geom_histogram(aes(x=Value, fill=Name), binwidth=0.1)
Homework:
B
is greater than the mean of distribution A
.We will now use several statistical tests implemented in the R language to analyze the iris dataset.
full.plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Your tasks:
Compare the mean sepal width of the Iris species. Use the t-test to check which differences are not random.
tapply
to compute meanst.test
functionsetosa <- iris[iris$Species=="setosa",]
versicolor <- iris[iris$Species == "versicolor", ]
virginica <- iris[iris$Species == "virginica", ]
ggplot(data=iris, aes(x=Sepal.Width, fill=Species)) + geom_histogram(binwidth=0.1)
Use the t-test to verify the hypothesis that the difference in sepal width of Iris versicolor and Iris virgninca is greater than 0.5
mu
and alternative
parameters of the function t.test
Verify the normality assumption of the t-test.
shapiro.test
functionThe null hypothesis in Shapiro-Wilk test is that a sample follows a normal distribution. If the p-value is large enough, we do not reject the null hypothesis. As in every statistical test, this does not prove that the distribution is normal - it only says that we do not have enough data to conclude that the deviation is non-random. However, a standard practice in this kind of tests is to be optimistic - and assume that the distribution is in fact normal if p-value is large enough (say, p>0.05).
In short, if Shapiro-Wilk test shows a low p-value, then we cannot use methods which rely on assumption about normality (e.g. t-test or ANOVA). If the p-value is large enough, then we might - and we do, even though we cannot formally prove that such procedure is correct.
Check if the sepal width is correlated with sepal length.
Create a scatter plot of the two variables, with colours corresponding to species. Use geom_point
.
Use the Pearson’s test of correlation implemented in cor.test
to check if there is a linear dependence between sepal width and length in case of Iris virginica and in case of Iris setosa. Compare the results with the plot.
sp.len.sp.width
Both correlations are visible. However, there are some slightly atypical observations in virginica, which could influence the results (number 32 and 18):
virginica$id <- 1:50
ggplot(data=virginica, aes(x=Sepal.Width, y=Sepal.Length)) + geom_text(aes(label=id))
Run the correlation test for virginica without observations 18 and 32.
x
and an integer vector y
, then x[-y]
will return the vector x
without coordinates from y
. For example, c(5, 8, 1, 6)[-c(1, 3)]
will return a vector c(8, 6)
.Maybe those observations do not come from the true distribution, i.e. we have outliers in our data? Run the Grubbs test for outliers.
outliers
and use the function grubbs.test
for the sepal widths of Iris virginica.We do not see the correlation without observations 18 and 32. However, by the results of the Grubbs test, we cannot be certain whether the observations come from a different distribution (i.e. whether those observations are outliers). So we should either stick with the conclusion that there is some non-zero correlation, or - better yet - collect more data.