Understanding data

Suppose we’re measuring some quantity, like the height of a person. Before we take the measurement, we can only make a reasonable guess about the measurement - that it will probably be something between 140 and 200 cm. We model this incertainity by assuming that the height is a random variable \(X: \Omega \rightarrow \mathbb{R}_+\) with some probability distribution, e.g. normal.

When we take this measurement, we perform a realization of \(X\) - that is, we “draw” \(\omega \in \Omega\) and substitute to get \(X(\omega)\), which now is a number, e.g. 178,2 cm. Performing a statistical experiment - a measurement, or a simulation - is nothing else but fixing \(\omega\) in our random variables.

We usually assume that each measurement is a realization of an independent random variable. Therefore, our data consists of a set of realizations of random variables, \(X_1 (\omega), X_2 (\omega), \dots, X_n(\omega)\). We usually assume that we know something about the probability distributions of \(X_1, \dots, X_n\), and that the distributions depend on some unknown parameters like the mean or variance. We say that the distributions of \(X_i\) belong to some family of distributions (for example, the family of all normal distributions with variance equal to 0.1). These assumptions are the base of all statistical reasoning, like parameter estimation or statistical tests.

The distributions of our random variables can be either assumed by looking at the data, or derived from some assumptions about the random process which generates them. In case of height, we can use both approaches. First, if we measure a lot of people, we will notice that the histogram resembles the normal distribution. Second, we know that height is determined by lots of factors (in fact, over 180 genes and numerous environmental conditions). Quantities which depend on many independent factors usually follow a normal distribution because of the Central Limit Theorem.

Your tasks:

Data frames

We will now cover basics of data exploration in R. In most applications, we will keep our actual data in data frames. Data frames are simply tables with data. Load and inspect an example data frame, called iris, by typing

data(iris)
View(iris)
iris[1:4, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa

The iris dataset contains information about certain measurements of several species of the Iris flower: Iris virginica, Iris versicolor and Iris setosa.

Iris virginica, Iris versicolor and Iris setosa.

Iris virginica, Iris versicolor and Iris setosa.

Columns of data frames are named. We can refer to those columns either by square brackets, like in matrices, or by their names. For example, typing iris$Species is the same as iris[, 5].

iris[1:10,5]
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
iris$Species[1:10]
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Data frames are created as follows:

example.frame <- data.frame("Name1" = rep(1:10, 2), "Name2" = rep(c("A", "B"), each=10))

Your tasks:

two.distributions
##    Name       Value
## 1     A -2.68133300
## 2     A  1.80512134
## 3     A -0.15617911
## 4     A -0.03401183
## 5     A  6.38742729
## 6     A  2.25297318
## 7     A  1.53619089
## 8     A -3.68275074
## 9     A  0.29859534
## 10    A  8.93306519
## 11    B -2.98121368
## 12    B  3.00001908
## 13    B -1.55908853
## 14    B -1.24934815
## 15    B -2.99439260
## 16    B  0.81269739
## 17    B -1.12969737
## 18    B -4.15613732
## 19    B  3.75391770
## 20    B -4.17590456
empirical.means
##         A         B 
##  1.465910 -1.067915

The ggplot2 library

We will now cover the ggplot2 library more extensively. We have already used function qplot(), which is a much simplified function used for fast plotting (qplot stands for quick plot). A more versatile function is ggplot().

To create a plot using ggplot(), we need to store our data in a data frame. By typing ggplot(some_data_frame), we create a basic plot object, which can be assigned to a variable. To get an actual plot, we need to add a function specifying the plot’s geometry (histogram/line plot/bar plot etc.). For example, to create a histogram of petal width from the iris dataframe, type:

iris.plot <- ggplot(data = iris)
iris.plot + geom_histogram(aes(x=Petal.Width))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The aes parameter specifies the aesthetics, i.e. variables plotted on axes. By default, geom_histogram accepts the name of the variable plotted on x axis and plots the counts on the y axis. Additionally, we can pass the variables mapped to different colour by specifying the fill aesthetics, and specify the width of bins of the histogram:

iris.plot + geom_histogram(aes(x=Petal.Width, fill=Species), binwidth=0.1)

To create a scatter plot representing the dependence of petal width on petal length, we would use the geom_scatter() function like this:

iris.plot + geom_point(aes(x=Petal.Length, y=Petal.Width, colour=Species))

It is important to remember to pass the names of the variables to the aes() function without quotation marks. Compare the previous plot with the following one:

ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, colour="Species"))

Your tasks:

two.plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Statistical tests

In one of the previous examples, the empirical means of two distributions could seem a bit odd. In fact, sometimes the empirical mean of observations from distribution B is smaller than the empirical mean of observations from distribution A, even though the true mean of B is greater than the true mean of A.

Suppose that we don’t know the true means of those distributions, and we want to compare them to check if the true means are different. For example, suppose that those distributions measure the effectiveness of two medical treatments. How can we check if treatment B is really better than treatment A?

We already know that the empirical means are different. This difference is caused either by actual difference in means, or by random noise. To check if a difference is significant, i.e. non-random, we need to run a statistical test.

We will use the two-sample t-test, to check if we can assume that differences in empirical means are non-random. The t-test compares the estimated, normalized means. The test statistic for two samples \(X_1, \dots, X_n\) and \(Y_1, \dots, Y_m\) is

\[T = \frac{\bar{X} - \bar{Y}}{\sqrt{S_X^2/n + S_Y^2/m}},\]

where \(S_X^2\) is the estimator of the variance. If the variables are normally distributed, then the \(T\) statistics follows the Student’s t distribution with \(n+m-1\) degrees of freedom. If the variables have a different distribution, then the Central Limit Theorem guarantees convergence to the Student’s t distribution.

If the value of the computed \(T\) statistic is very large or very small, then we’re fairly certain that the means are different. To check whether the value is small or large enough, we need to compute the probability that the absolute value of \(T\) statistic is larger than the observed value. This probability is called the p-value.

In our case, since the density of Student’s t distribution is symmetric, the p-value is equal to the probability

\[ \mathbb{P}(T > |T(\omega)|) + \mathbb{P}(T < -|T(\omega)) = 2\mathbb{P}(T < -|T(\omega)|),\]

where \(T(\omega)\) is our observed value. The values of the cumulative distribution function Student’s t distribution can be computed numerically.

Your tasks:

distr.A <- two.distributions$Value[two.distributions$Name == "A"]
distr.B <- two.distributions$Value[two.distributions$Name == "B"]
paste("My p-value:", my.t.test(distr.A, distr.B))
## [1] "My p-value: 0.106185245293114"
t.test(distr.A, distr.B)
## 
##  Welch Two Sample t-test
## 
## data:  distr.A and distr.B
## t = 1.6961, df = 16.503, p-value = 0.1086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6252438  5.6928931
## sample estimates:
## mean of x mean of y 
##  1.465910 -1.067915

The p-value is large. This means that there is a large probability of observing a larger difference in means. In other words, our difference is very typical, so we can’t assume that the treatments A and B give different effects. The difference in empirical means is purely due to random noise.

t.test(rnorm(1000, -0.1, 3), rnorm(1000, 0.1, 3))
## 
##  Welch Two Sample t-test
## 
## data:  rnorm(1000, -0.1, 3) and rnorm(1000, 0.1, 3)
## t = -0.81239, df = 1997.6, p-value = 0.4167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3728548  0.1544311
## sample estimates:
##    mean of x    mean of y 
## -0.108880844  0.000330997
t.test(rnorm(100000, -0.1, 3), rnorm(100000, 0.1, 3))
## 
##  Welch Two Sample t-test
## 
## data:  rnorm(1e+05, -0.1, 3) and rnorm(1e+05, 0.1, 3)
## t = -16.029, df = 2e+05, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2412697 -0.1886963
## sample estimates:
##  mean of x  mean of y 
## -0.1114743  0.1035087

Due to large variance, we need 100 000 observations to get a p-value which is small enough to allow us to conclude that there is a difference in treatments.

ggplot(data=data.frame("Name" = rep(c("A", "B"), each=10000), "Value" = c(rnorm(10000, -0.1, 3), rnorm(10000, 0.1, 3)))) + geom_histogram(aes(x=Value, fill=Name), binwidth=0.1)

Homework:

Analysis of the iris dataset

We will now use several statistical tests implemented in the R language to analyze the iris dataset.

full.plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Your tasks:

setosa <- iris[iris$Species=="setosa",]
versicolor <- iris[iris$Species == "versicolor", ]
virginica <- iris[iris$Species == "virginica", ]
ggplot(data=iris, aes(x=Sepal.Width, fill=Species)) + geom_histogram(binwidth=0.1)

The null hypothesis in Shapiro-Wilk test is that a sample follows a normal distribution. If the p-value is large enough, we do not reject the null hypothesis. As in every statistical test, this does not prove that the distribution is normal - it only says that we do not have enough data to conclude that the deviation is non-random. However, a standard practice in this kind of tests is to be optimistic - and assume that the distribution is in fact normal if p-value is large enough (say, p>0.05).

In short, if Shapiro-Wilk test shows a low p-value, then we cannot use methods which rely on assumption about normality (e.g. t-test or ANOVA). If the p-value is large enough, then we might - and we do, even though we cannot formally prove that such procedure is correct.

sp.len.sp.width

Both correlations are visible. However, there are some slightly atypical observations in virginica, which could influence the results (number 32 and 18):

virginica$id <- 1:50
ggplot(data=virginica, aes(x=Sepal.Width, y=Sepal.Length)) + geom_text(aes(label=id))

We do not see the correlation without observations 18 and 32. However, by the results of the Grubbs test, we cannot be certain whether the observations come from a different distribution (i.e. whether those observations are outliers). So we should either stick with the conclusion that there is some non-zero correlation, or - better yet - collect more data.