Clustering

We will now cover the basics of the clustering technique. The goal of clustering is to group observations. Assume we’re investigating the Iris dataset, but we don’t know the species - only their measurements. We want to group the observations into groups corresponding to different species.

The basic clustering algorithm is the k-means. It’s basics were covered at the beginning of the tutorial. We’ll see how to use this algorithm on the Iris dataset.

Your tasks:

ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, col=Species, shape=cluster), size=2)

ggplot(data=iris) + geom_point(aes(x=Petal.Length, y=Petal.Width, col=Species, shape=cluster2), size=2)

The Dreadful Mouse

The Mouse is a type of a distribution that is designed to look cute, sound cute, and completely destroy the k-means algorithm. It’s a mixture of three Normal distributions on a 2-dimentional space. The first distribution is centered at the origin. This is the head of the mouse. The other two are centered at points (-1, 1) and (1, 1). These are the ears. The ears have a common standard deviation, which is smaller than the standard deviation of the head. A simuated point belongs to head with a given probability. Formally, let \(M\) be an observation from the mouse, and \(f_M (x, y)\) it’s 2-dimentional density. Let \(H\) be a random point from the head, and \(E_1\) and \(E_2\) from the ears. Then, we have

\[H \sim \mathcal{N}((0, 0), \sigma^2_H I),\] \[E_1 \sim \mathcal{N}((-1, 1), \sigma^2_E I),\] \[E_2 \sim \mathcal{N}((1, 1), \sigma^2_EI).\]

The density of a random point from the Mouse is given by

\[f_M (x, y) = pf_H (x, y) + \frac{1-p}{2} f_{E1} (x, y) + \frac{1-p}{2} f_{E2} (x, y),\] where \(p\) is the probability that a given point belongs to the head.

Your tasks:

mouse.data <- mouse(10000, c(0.4, 0.2), 0.8)
ggplot(data=mouse.data) + geom_point(aes(x=x, y=y, col=d))

The mouse is an artificial dataset, but it illustrates a very common problem. The k-means algorihtm always tends to return equally sized clusters, but such clusters are very rare in reality. This sometimes leads to very wrong conclusions. I have once read a paper published in a good journal, in which the authors were looking for horizontally transferred genes in a bacterium E. coli. Such genes are usually very rare, but they are often responsible for the bacteria’s ability to cause disease, so they’re very important. The authors have reported stunning results: as much as 1/3 of the genesof this bacterium were horizontally transferred! Only problem is, to detect them, they’ve used k-means with three clusters


The EM algorithm for Gaussian Mixtures

The EM algorithm is a basis for a clustering method which can handle difficult distributions like The Mouse. The theory behind EM and application for clustering was shown at the lecture and will be quickly reminded at the whiteboard.

We will apply EM to cluster The Mouse and the Iris dataset. First, install the EMCluster package and load it.

install.packages("EMCluster")
library(EMCluster)

The main function is init.EM(). It accepts the data in a matrix format and the number of desired clusters. However, for educational purposes, we will use a more basic function: emcluster(). Instead of the number of clusters, it accepts a list of initial parameters: mixing proporions, centers of clusters, and a list of covariance matrices. The definition of a covariance matrix will be briefly reminded at the whiteboard.

We will estimate the initial parameters from the results of the k-means clustering.

Your tasks:

After we get our estimates, we can run the EM clustering by this command:

#install.packages("EMCluster")
library(EMCluster)
## Loading required package: MASS
## Loading required package: Matrix
mouse.EM <- emcluster(mouse.data[,1:2], pi=cluster.props, Mu=cluster.means, LTSigma=cluster.covs)

A simpler way to do just the above would be typing init.EM(mouse.data[,1:2], 3). However, it would take considerably longer.

We can now inspect the results. The proportions:

mouse.EM$pi
## [1] 0.6747145 0.1688440 0.1564415

Close enough to \((0.1, 0.1, 0.8)\), the true values. The cluster centers:

mouse.EM$Mu
##             [,1]       [,2]
## [1,] -0.01036052 -0.0974915
## [2,]  0.76865921  0.7757084
## [3,] -0.78565987  0.8048968

Also close to \((1, 1), (-1, 1), (0, 0)\), the true values. The covariances:

mouse.EM$LTSigma
##           [,1]         [,2]      [,3]
## [1,] 0.1354137 -0.001686466 0.1254527
## [2,] 0.1369871  0.066176761 0.1221919
## [3,] 0.1276636 -0.060954399 0.1187768

That’s also close to true variances: \(0.16\) variance for the head and \(0.04\) for the ears at \(x\) and \(y\) coordinates, and zero covariances.

We can now assign the observations to clusters, by picking the cluster with the highest likelihood for each observation. Ask me if you’re not sure what that means. The function assign.class will the assignment for us. It accepts the data matrix and the result of EM clustering.

mouse.data$EMclstr <- assign.class(mouse.data[,1:2], emobj=mouse.EM)$class

Finally, we can plot the results of the EM clustering and compare it with K-means. Before plotting, convert the EMcluster column to a factor.

mouse.data$EMclstr <- as.factor(mouse.data$EMclstr)
ggplot(data=mouse.data) + geom_point(aes(x=x, y=y, col=EMclstr))

Much better!