Lab5

Understanding data

Suppose we’re measuring some quantity, like the height of a person. Before we take the measurement, we can only make a reasonable guess about the measurement - that it will probably be something between 140 and 200 cm. We model this incertainity by assuming that the height is a random variable $X: \Omega \rightarrow \mathbb{R}_+$ with some probability distribution, e.g. normal.

When we take this measurement, we perform a realization of $X$ - that is, we “draw” $\omega \in \Omega$ and substitute to get $X(\omega)$, which now is a number, e.g. 178,2 cm. Performing a statistical experiment - a measurement, or a simulation - is nothing else but fixing $\omega$ in our random variables.

We usually assume that each measurement is a realization of an independent random variable. Therefore, our data consists of a set of realizations of random variables, $X_1 (\omega), X_2 (\omega), \dots, X_n(\omega)$. We usually assume that we know something about the probability distributions of $X_1, \dots, X_n$, and that the distributions depend on some unknown parameters like the mean or variance. We say that the distributions of $X_i$ belong to some family of distributions (for example, the family of all normal distributions with variance equal to 0.1). These assumptions are the base of all statistical reasoning, like parameter estimation or statistical tests.

The distributions of our random variables can be either assumed by looking at the data, or derived from some assumptions about the random process which generates them. In case of height, we can use both approaches. First, if we measure a lot of people, we will notice that the histogram resembles the normal distribution. Second, we know that height is determined by lots of factors (in fact, over 180 genes and numerous environmental conditions). Quantities which depend on many independent factors usually follow a normal distribution because of the Central Limit Theorem.

Your tasks:

Construct a probabilistic space, a sequence of random variables $X_i$, and an elementary event $\omega$ such that:
- $\mathbb{P}(X_i = 1) = 2/3, \mathbb{P}(X_i = -1) = 1/3$,
- After fixing $\omega$, we observe a sequence $(1, -1, 1, 1, -1, -1)$.

Data frames

We will now cover basics of data exploration in R. In most applications, we will keep our actual data in data frames. Data frames are simply tables with data. Load and inspect an example data frame, called iris, by typing

data(iris)
View(iris)
iris[1:4, ]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa

The iris dataset contains information about certain measurements of several species of the Iris flower: Iris virginica, Iris versicolor and Iris setosa.

Iris virginica, Iris versicolor and Iris setosa.

Columns of data frames are named. We can refer to those columns either by square brackets, like in matrices, or by their names. For example, typing iris$Species is the same as iris[, 5].

iris[1:10,5]

##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

iris$Species[1:10]

##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Data frames are created as follows:

example.frame <- data.frame("Name1" = rep(1:10, 2), "Name2" = rep(c("A", "B"), each=10))

Your tasks:

Create a data frame with two columns, storing observations from two normal distributions with means -0.1 and 0.1 and standard deviations 3. Pick 10 observations from each group. The first column needs to store the name of the distribution, the second one - the observed values. Try to write the command in one line.

two.distributions

##    Name       Value
## 1     A -2.68133300
## 2     A  1.80512134
## 3     A -0.15617911
## 4     A -0.03401183
## 5     A  6.38742729
## 6     A  2.25297318
## 7     A  1.53619089
## 8     A -3.68275074
## 9     A  0.29859534
## 10    A  8.93306519
## 11    B -2.98121368
## 12    B  3.00001908
## 13    B -1.55908853
## 14    B -1.24934815
## 15    B -2.99439260
## 16    B  0.81269739
## 17    B -1.12969737
## 18    B -4.15613732
## 19    B  3.75391770
## 20    B -4.17590456

Compare the empirical means of the observed values. Does the difference in empirical means agree with the real means of the distributions?
- Hint: The function tapply(X, I, F) applies function F to values of vector X grouped by the values of index I. For example, tapply(c(1, 2, 3, 4), c("A", "A", "B", "B"), sum) will apply the function sum to vectors c(1, 2) and c(3, 4), and return a vector c(3, 7). This function is extremely useful in tasks like this one. Use it to apply the function mean to observed values, grouped by the distribution names.

empirical.means

##         A         B 
##  1.465910 -1.067915

The ggplot2 library

We will now cover the ggplot2 library more extensively. We have already used function qplot(), which is a much simplified function used for fast plotting (qplot stands for quick plot). A more versatile function is ggplot().

To create a plot using ggplot(), we need to store our data in a data frame. By typing ggplot(some_data_frame), we create a basic plot object, which can be assigned to a variable. To get an actual plot, we need to add a function specifying the plot’s geometry (histogram/line plot/bar plot etc.). For example, to create a histogram of petal width from the iris dataframe, type:

iris.plot <- ggplot(data = iris)
iris.plot + geom_histogram(aes(x=Petal.Width))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The aes parameter specifies the aesthetics, i.e. variables plotted on axes. By default, geom_histogram accepts the name of the variable plotted on x axis and plots the counts on the y axis. Additionally, we can pass the variables mapped to different colour by specifying the fill aesthetics, and specify the width of bins of the histogram:

iris.plot + geom_histogram(aes(x=Petal.Width, fill=Species), binwidth=0.1)

To create a scatter plot representing the dependence of petal width on petal length, we would use the geom_scatter() function like this:

iris.plot + geom_point(aes(x=Petal.Length, y=Petal.Width, colour=Species))

It is important to remember to pass the names of the variables to the aes() function without quotation marks. Compare the previous plot with the following one:

ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, colour="Species"))

Your tasks:

Create a histogram of observations from two.distributions data frame. Highlight the distributions with different colors.

two.plot

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Statistical tests

In one of the previous examples, the empirical means of two distributions could seem a bit odd. In fact, sometimes the empirical mean of observations from distribution B is smaller than the empirical mean of observations from distribution A, even though the true mean of B is greater than the true mean of A.

Suppose that we don’t know the true means of those distributions, and we want to compare them to check if the true means are different. For example, suppose that those distributions measure the effectiveness of two medical treatments. How can we check if treatment B is really better than treatment A?

We already know that the empirical means are different. This difference is caused either by actual difference in means, or by random noise. To check if a difference is significant, i.e. non-random, we need to run a statistical test.

We will use the two-sample t-test, to check if we can assume that differences in empirical means are non-random. The t-test compares the estimated, normalized means. The test statistic for two samples $X_1, \dots, X_n$ and $Y_1, \dots, Y_m$ is

\[T = \frac{\bar{X} - \bar{Y}}{\sqrt{S_X^2/n + S_Y^2/m}},\]

where $S_X^2$ is the estimator of the variance. If the variables are normally distributed, then the $T$ statistics follows the Student’s t distribution with $n+m-1$ degrees of freedom. If the variables have a different distribution, then the Central Limit Theorem guarantees convergence to the Student’s t distribution.

If the value of the computed $T$ statistic is very large or very small, then we’re fairly certain that the means are different. To check whether the value is small or large enough, we need to compute the probability that the absolute value of $T$ statistic is larger than the observed value. This probability is called the p-value.

In our case, since the density of Student’s t distribution is symmetric, the p-value is equal to the probability

\[ \mathbb{P}(T > |T(\omega)|) + \mathbb{P}(T < -|T(\omega)) = 2\mathbb{P}(T < -|T(\omega)|),\]

where $T(\omega)$ is our observed value. The values of the cumulative distribution function Student’s t distribution can be computed numerically.

Your tasks:

Write a function which runs the Student’s t test for two samples and returns the p-value.
- Hint: Use pt(x, df) for the cumulative distribution function at point x for the distribution with df degrees of freedom.
- Avoid overwriting the letter T, which is a boolean variable
Use your function to run a t-test on the values from two.distributions data frame. Compare your results with the results of the function t.test
Check what happens if you use 1000 and 100 000 observations in two.distributions.

distr.A <- two.distributions$Value[two.distributions$Name == "A"]
distr.B <- two.distributions$Value[two.distributions$Name == "B"]
paste("My p-value:", my.t.test(distr.A, distr.B))

## [1] "My p-value: 0.106185245293114"

t.test(distr.A, distr.B)

## 
##  Welch Two Sample t-test
## 
## data:  distr.A and distr.B
## t = 1.6961, df = 16.503, p-value = 0.1086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6252438  5.6928931
## sample estimates:
## mean of x mean of y 
##  1.465910 -1.067915

The p-value is large. This means that there is a large probability of observing a larger difference in means. In other words, our difference is very typical, so we can’t assume that the treatments A and B give different effects. The difference in empirical means is purely due to random noise.

t.test(rnorm(1000, -0.1, 3), rnorm(1000, 0.1, 3))

## 
##  Welch Two Sample t-test
## 
## data:  rnorm(1000, -0.1, 3) and rnorm(1000, 0.1, 3)
## t = -0.81239, df = 1997.6, p-value = 0.4167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3728548  0.1544311
## sample estimates:
##    mean of x    mean of y 
## -0.108880844  0.000330997

t.test(rnorm(100000, -0.1, 3), rnorm(100000, 0.1, 3))

## 
##  Welch Two Sample t-test
## 
## data:  rnorm(1e+05, -0.1, 3) and rnorm(1e+05, 0.1, 3)
## t = -16.029, df = 2e+05, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2412697 -0.1886963
## sample estimates:
##  mean of x  mean of y 
## -0.1114743  0.1035087

Due to large variance, we need 100 000 observations to get a p-value which is small enough to allow us to conclude that there is a difference in treatments.

ggplot(data=data.frame("Name" = rep(c("A", "B"), each=10000), "Value" = c(rnorm(10000, -0.1, 3), rnorm(10000, 0.1, 3)))) + geom_histogram(aes(x=Value, fill=Name), binwidth=0.1)

Homework:

Modify your function to run a one-sided test, which verifies whether mean of distribution B is greater than the mean of distribution A.

Analysis of the iris dataset

We will now use several statistical tests implemented in the R language to analyze the iris dataset.

full.plot

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Your tasks:

Compare the mean sepal width of the Iris species. Use the t-test to check which differences are not random.
- Use tapply to compute means
- Use built-in t.test function
- Try splitting the data frame into three, corresponding to different species:

setosa <- iris[iris$Species=="setosa",]
versicolor <- iris[iris$Species == "versicolor", ]
virginica <- iris[iris$Species == "virginica", ]
ggplot(data=iris, aes(x=Sepal.Width, fill=Species)) + geom_histogram(binwidth=0.1)

Use the t-test to verify the hypothesis that the difference in sepal width of Iris versicolor and Iris virgninca is greater than 0.5
- Take a look at mu and alternative parameters of the function t.test
Verify the normality assumption of the t-test.
- Use Shapiro-Wilk test of normality, implemented in shapiro.test function
- Either use the three data frames, or use tapply to apply the test to sepal widths grouped by species name
- Conclude whether it is appropriate to use the t-test.

The null hypothesis in Shapiro-Wilk test is that a sample follows a normal distribution. If the p-value is large enough, we do not reject the null hypothesis. As in every statistical test, this does not prove that the distribution is normal - it only says that we do not have enough data to conclude that the deviation is non-random. However, a standard practice in this kind of tests is to be optimistic - and assume that the distribution is in fact normal if p-value is large enough (say, p>0.05).

In short, if Shapiro-Wilk test shows a low p-value, then we cannot use methods which rely on assumption about normality (e.g. t-test or ANOVA). If the p-value is large enough, then we might - and we do, even though we cannot formally prove that such procedure is correct.

Check if the sepal width is correlated with sepal length.
- Create a scatter plot of the two variables, with colours corresponding to species. Use geom_point.
- Use the Pearson’s test of correlation implemented in cor.test to check if there is a linear dependence between sepal width and length in case of Iris virginica and in case of Iris setosa. Compare the results with the plot.

sp.len.sp.width

Both correlations are visible. However, there are some slightly atypical observations in virginica, which could influence the results (number 32 and 18):

virginica$id <- 1:50
ggplot(data=virginica, aes(x=Sepal.Width, y=Sepal.Length)) + geom_text(aes(label=id))

Run the correlation test for virginica without observations 18 and 32.
- Hint: If you have any vector x and an integer vector y, then x[-y] will return the vector x without coordinates from y. For example, c(5, 8, 1, 6)[-c(1, 3)] will return a vector c(8, 6).
Maybe those observations do not come from the true distribution, i.e. we have outliers in our data? Run the Grubbs test for outliers.
- Install and load the package outliers and use the function grubbs.test for the sepal widths of Iris virginica.
- Note that the Grubbs test assumes that the data come from a normal distribution. Fortunately, we have checked that already.

We do not see the correlation without observations 18 and 32. However, by the results of the Grubbs test, we cannot be certain whether the observations come from a different distribution (i.e. whether those observations are outliers). So we should either stick with the conclusion that there is some non-zero correlation, or - better yet - collect more data.

Lab5

Michał Ciach

October 26, 2017

Understanding data

Data frames

The ggplot2 library

Statistical tests

Analysis of the iris dataset