We will now cover the basics of the clustering technique. The goal of clustering is to group observations. Assume weâre investigating the Iris dataset, but we donât know the species - only their measurements. We want to group the observations into groups corresponding to different species.
The basic clustering algorithm is the k-means. Itâs basics were covered at the beginning of the tutorial. Weâll see how to use this algorithm on the Iris dataset.
Your tasks:
Load the iris dataset by typing data(iris)
. Visualize the petal widths and lengths of different species using ggplot()
and geom_point()
functions.
Use the k-means method to cluster the observations into three groups, based on their petal lengths and widths (columns 3 and 4 in the iris table). Append the inferred cluster numbers to the iris data frame. Visualize the results.
ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, col=Species, shape=cluster), size=2)
ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, col=Species, shape=cluster2), size=2)
The Mouse is a type of a distribution that is designed to look cute, sound cute, and completely destroy the k-means algorithm. Itâs a mixture of three Normal distributions on a 2-dimentional space. The first distribution is centered at the origin. This is the head of the mouse. The other two are centered at points (-1, 1) and (1, 1). These are the ears. The ears have a common standard deviation, which is smaller than the standard deviation of the head. A simuated point belongs to head with a given probability. Formally, let \(M\) be an observation from the mouse, and \(f_M (x, y)\) itâs 2-dimentional density. Let \(H\) be a random point from the head, and \(E_1\) and \(E_2\) from the ears. Then, we have
\[H \sim \mathcal{N}((0, 0), \sigma^2_H I),\] \[E_1 \sim \mathcal{N}((-1, 1), \sigma^2_E I),\] \[E_2 \sim \mathcal{N}((1, 1), \sigma^2_EI).\]
The density of a random point from the Mouse is given by
\[f_M (x, y) = pf_H (x, y) + \frac{1-p}{2} f_{E1} (x, y) + \frac{1-p}{2} f_{E2} (x, y),\] where \(p\) is the probability that a given point belongs to the head.
Your tasks:
Write a function mouse(N, sds, prop)
to simulate observations from the Mouse distribution. In the function call, N
is the number of observations, sds
is a vector of standard deviations (length 2), and prop
is the probability of the head. The function should return a data frame with coordinates and distribution (head/ear1/ear2).
Simulate 10 000 observations with chosen parameters and plot them on a 2-dimentional scatter plot.
mouse.data <- mouse(10000, c(0.4, 0.2), 0.8)
ggplot(data=mouse.data) + geom_point(aes(x=x, y=y, col=d))
The mouse is an artificial dataset, but it illustrates a very common problem. The k-means algorihtm always tends to return equally sized clusters, but such clusters are very rare in reality. This sometimes leads to very wrong conclusions. I have once read a paper published in a good journal, in which the authors were looking for horizontally transferred genes in a bacterium E. coli. Such genes are usually very rare, but they are often responsible for the bacteriaâs ability to cause disease, so theyâre very important. The authors have reported stunning results: as much as 1/3 of the genesof this bacterium were horizontally transferred! Only problem is, to detect them, theyâve used k-means with three clustersâŠ
The EM algorithm is a basis for a clustering method which can handle difficult distributions like The Mouse. The theory behind EM and application for clustering was shown at the lecture and will be quickly reminded at the whiteboard.
We will apply EM to cluster The Mouse and the Iris dataset. First, install the EMCluster package and load it.
install.packages("EMCluster")
library(EMCluster)
The main function is init.EM()
. It accepts the data in a matrix format and the number of desired clusters. However, for educational purposes, we will use a more basic function: emcluster()
. Instead of the number of clusters, it accepts a list of initial parameters: mixing proporions, centers of clusters, and a list of covariance matrices. The definition of a covariance matrix will be briefly reminded at the whiteboard.
We will estimate the initial parameters from the results of the k-means clustering.
Your tasks:
Estimate the proportions of observations belonging to different clusters from The Mouse.
Use the function mean()
to estimate the centers of the clusters. Return your results in a 3x2 matrix.
Use the function cov(m)
to estimate the covariance matrix from data in matrix m
. Estimate covariance matrices in groups of clusters. Next, take the lower traingular part of the matrix using the function lower.tri(m, diag=TRUE)
and write it as a vector. Stack those vectors row by row into a 3x3 matrix. You can use a for
loop in this task. Ask me if you have no idea how to do this point.
lower.tri
returns a boolean matrix. Writing m[lower.tri(m, diag=TRUE)
will yield a vector.After we get our estimates, we can run the EM clustering by this command:
#install.packages("EMCluster")
library(EMCluster)
## Loading required package: MASS
## Loading required package: Matrix
mouse.EM <- emcluster(mouse.data[,1:2], pi=cluster.props, Mu=cluster.means, LTSigma=cluster.covs)
A simpler way to do just the above would be typing init.EM(mouse.data[,1:2], 3)
. However, it would take considerably longer.
We can now inspect the results. The proportions:
mouse.EM$pi
## [1] 0.6747145 0.1688440 0.1564415
Close enough to \((0.1, 0.1, 0.8)\), the true values. The cluster centers:
mouse.EM$Mu
## [,1] [,2]
## [1,] -0.01036052 -0.0974915
## [2,] 0.76865921 0.7757084
## [3,] -0.78565987 0.8048968
Also close to \((1, 1), (-1, 1), (0, 0)\), the true values. The covariances:
mouse.EM$LTSigma
## [,1] [,2] [,3]
## [1,] 0.1354137 -0.001686466 0.1254527
## [2,] 0.1369871 0.066176761 0.1221919
## [3,] 0.1276636 -0.060954399 0.1187768
Thatâs also close to true variances: \(0.16\) variance for the head and \(0.04\) for the ears at \(x\) and \(y\) coordinates, and zero covariances.
We can now assign the observations to clusters, by picking the cluster with the highest likelihood for each observation. Ask me if youâre not sure what that means. The function assign.class
will the assignment for us. It accepts the data matrix and the result of EM clustering.
mouse.data$EMclstr <- assign.class(mouse.data[,1:2], emobj=mouse.EM)$class
Finally, we can plot the results of the EM clustering and compare it with K-means. Before plotting, convert the EMcluster column to a factor.
mouse.data$EMclstr <- as.factor(mouse.data$EMclstr)
ggplot(data=mouse.data) + geom_point(aes(x=x, y=y, col=EMclstr))
Much better!