Monte Carlo approach to Chocolate

Today we will use more advanced techniques to analyze the chocolate dataset. Recall that in the previous exercises (lab6), we’ve assumed that the prior probability of being tasty was \(Beta(a, b)\). We’ve parameterized this distribution with \(n=a+b\) and \(m = a/(a+b)\), i.e. the number of chocolate bars in our prior knowledge and the assumed average tastiness. In this parametrization, the prior was \(Beta(mn, (1-m)n)\). The parameter \(n\) represented the strength of our prior knowledge - in other words, it was equal to the number of chocolate bars that we would need to inspect so that we would equally believe in the data and our prior knowledge.

We assumed that \(90\%\) bars are tasty, and one would need 20 bars to change his/her beliefs. This lead to \(m=0.9\) and \(n=20\). In reality, we are not exactly sure if this should be \(n=20\) - maybe it’s \(21\) or \(18\)? In fact, it’s reasonable to assume that \(m\) and \(n\) are random as well. This leads us to the notion of hierarchical models, which have multiple layers of randomness.

As before, let \(X\) be the number of tasty chocolate bars, and assume that \(X \sim Bin(n, \theta)\) and \(\theta \sim Beta(nm, n(m-1))\). This time, however, we’ll assume that \(n\) and \(m\) are random. It’s reasonable to assume that \(n\) is from Gamma distribution (which is a bit similar to Poisson distribution, but continuous), and \(m\) from Beta. The full specification of our model is as follows:

\[\begin{align} X & \sim Bin(n, \theta), \\ \theta & \sim Beta(nm, n(1-m)), \\ n & \sim Gamma(s, r), \\ m & \sim Beta(\mu, \nu). \end{align}\]

The deterministic variables \(s, r, \mu\) and \(\nu\) are called hyperparameters, because they influence the parameter \(\theta\). The values of the hyperparameters represent our prior knowledge. Our goal is to infer the posterior distributions of \(\theta, n\) and \(m\). This is, however, usually impossible. Instead, we’ll use the Monte Carlo approach to simulate the random parameters from the posterior distribution.

There are many methods to sample from the posterior distributions, the most important being the Metropolis-Hastings algorithm and the Gibbs sampler. The latter often has a better performance, but also often it’s very difficult to use. Therefore, we’ll start with Metropolis-Hastings (MH).

The theory between the MH algorithm has been covered during lecture, and I assume you more or less know the idea behind it. This algorithm, in it’s basic form, is very simple, and it’s a very good exercise to implement it on your own. However, we won’t do that here; Instead, we’ll focus on learning STAN, a programming language for hierarchical bayesian modelling. STAN has interfaces for many popular languages, including R and Python. Here, we’ll focus on the R interface; You can use Python if you prefer, but you’ll have to install it by yourself. Apart from installation, using STAN is exactly the same in all languages.

STAN

Adapted from this tutorial. Full documentation of STAN language is here.

First, install the rstan package. STAN is written in C++, so the installation needs to take some time to compile it.

install.packages('rstan')

Load the installed package. We’ll also load some additional packages and enable multithreading.

library(ggplot2); library(rstan); library(reshape2)
## Loading required package: StanHeaders
## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

A Stan model is comprised of code blocks. Each block is a place for a certain task. The bold blocks below must be present in all Stan programs (even if they contain no arguments):

  1. functions, where we define functions to be used in the blocks below. This is where we will write out a random number generator that gives us draws from our assumed model.
  2. data, declares the data to be used for the model
  3. transformed data, makes transformations of the data passed in above
  4. parameters, defines the unknowns to be estimated, including any restrictions on their values.
  5. transformed parameters, often it is preferable to work with transformations of the parameters and data declared above; in this case we define them here.
  6. model, where the full probability model is defined.
  7. generated quantities, generates a range of outputs from the model (posterior predictions, forecasts, values of loss functions, etc.).

First, the model, composed of the above blocks, needs to be written as a string. Below is an example of a model written in STAN language; this is the model from the previous tutorial.

chocolate.model <- "
data {
  // In this section, we define the data that must be passed to Stan (from whichever environment you are using)

  int N; // number of tested chocolate bars
  int X; // number of tasty bars
  real n;
  real m; // hyperparameters n and m
}
parameters {
  // Define the parameters that we will estimate, as well as any restrictions on the parameter values (e.g. standard deviations can't be negative...)
  real<lower=0, upper=1> theta;
}
transformed parameters {
  vector[2] beta_pars;  // vector of parameters
  beta_pars[1] = n*m;
  beta_pars[2] = n*(1-m);
}
model {
  // This is where we write out the probability model, in very similar form to how we would using paper and pen

  // Define the priors
  theta ~ beta(beta_pars[1], beta_pars[2]); // a= n*m, b=n*(1-m)
  
  // The likelihood
  X ~ binomial(N, theta);
}
generated quantities {
  // All the parameters are simulated by default, so no need to write them here. We will also generate posterior predictive draws (yhat) for each data point. These will be elaborated on below. 
  
  int y_sim;
  y_sim = binomial_rng(N, theta);
}"

Now, load the data and set the prior hyperparameters.

library(readr)
chocolate <- read_csv("chocolate.csv")
## Parsed with column specification:
## cols(
##   Company = col_character(),
##   `Specific-Bean-Origin-or-Bar-Name` = col_character(),
##   REF = col_integer(),
##   `Review-Date` = col_integer(),
##   `Cocoa-Percent` = col_character(),
##   `Company-Location` = col_character(),
##   Rating = col_double(),
##   `Bean-Type` = col_character(),
##   `Broad-Bean-Origin` = col_character()
## )
chocolate$Good <- ifelse(chocolate$Rating > 2, "Yes", "No")
prior.n <- 20
prior.m <- 0.9
#chocolate$Good 

Specify the data list that we will pass to Stan. This gives Stan everything declared in the data{} block.

data_list <- list(N = nrow(chocolate), X = sum(chocolate$Good=="Yes"), n = prior.n, m = prior.m)

Call Stan. You’ll need to specify either model_code (like the ones we defined above), a file (.stan file, containing code), or a fitted Stan object (fit). You should also pass Stan a data list, number of cores to estimate on, the number of Markov chains to run (4 by default) and number of iterations (2000 by default).

We’ll run just one chain with 12 000 iterations. Here, an iteration is a step of a Markov chain. Each step realizes sampling from the posterior distribution. We will discard the first 2000 iterations. This is called burn-in or warm-up, and is used to let the chain converge to the stationary distribution.

The first time you run the model, it needs to compile the code; this may take some time.

chocolate.fit <- stan(model_code = chocolate.model, data = data_list, warmup=2000, chains = 1, iter = 12000)
## 
## SAMPLING FOR MODEL 'e5ec657c5fff606808dee71fdbc15482' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:     1 / 12000 [  0%]  (Warmup)
## Chain 1: Iteration:  1200 / 12000 [ 10%]  (Warmup)
## Chain 1: Iteration:  2001 / 12000 [ 16%]  (Sampling)
## Chain 1: Iteration:  3200 / 12000 [ 26%]  (Sampling)
## Chain 1: Iteration:  4400 / 12000 [ 36%]  (Sampling)
## Chain 1: Iteration:  5600 / 12000 [ 46%]  (Sampling)
## Chain 1: Iteration:  6800 / 12000 [ 56%]  (Sampling)
## Chain 1: Iteration:  8000 / 12000 [ 66%]  (Sampling)
## Chain 1: Iteration:  9200 / 12000 [ 76%]  (Sampling)
## Chain 1: Iteration: 10400 / 12000 [ 86%]  (Sampling)
## Chain 1: Iteration: 11600 / 12000 [ 96%]  (Sampling)
## Chain 1: Iteration: 12000 / 12000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.019209 seconds (Warm-up)
## Chain 1:                0.094483 seconds (Sampling)
## Chain 1:                0.113692 seconds (Total)
## Chain 1:

After STAN has run the simulation, we can extract the results.

thetas <- extract(chocolate.fit, "theta", permuted=F)
posteriors <- extract(chocolate.fit, "y_sim", permuted=F)

Finally we can visualize the posterior distribution of \(\theta\), and perform the posterior predictive check - i.e. visualize the distribution of the data under the posterior distribution of the parameter.

First, compare the simulation results with theoretical predictions.

qplot(c(thetas), binwidth=0.001, xlim=c(0.0,1))
## Warning: Removed 2 rows containing missing values (geom_bar).

qplot(seq(0, 1, by=0.001), dbeta(seq(0, 1, by=0.001), 
                                 shape1=prior.n*prior.m + sum(chocolate$Good=="Yes"), 
                                 shape2=prior.n*(1-prior.m) + nrow(chocolate) - sum(chocolate$Good=="Yes")),
      geom="line")

Next, compare the posterior data distributions.

qplot(c(posteriors), xlim=c(1600, 1800), binwidth=1, main="Posterior distribution from Stan Monte Carlo simulation")
## Warning: Removed 2 rows containing missing values (geom_bar).

pred.post.X <- rbinom(10000, nrow(chocolate), rbeta(10000, 
                                                    shape1=prior.n*prior.m + sum(chocolate$Good=="Yes"), 
                                                    shape2=prior.n*(1-prior.m) + nrow(chocolate) - sum(chocolate$Good=="Yes")))
qplot(pred.post.X, xlim=c(1600, 1800), binwidth=1, main="Posterior distribution from our own simulations")
## Warning: Removed 2 rows containing missing values (geom_bar).

Both plots seem exactly the same! This means that our method works, and our STAN model was properly specified. Let’s now use STAN to simulate posterior distributions in our simple hierarchical model.

Simple hierarchical model

Recall that, at the beginning of this tutorial, we have discussed the following hierarchical model:

\[\begin{align} X & \sim Bin(n, \theta), \\ \theta & \sim Beta(nm, n(1-m)), \\ n & \sim Gamma(s, r), \\ m & \sim Beta(\mu, \nu). \end{align}\]

The only difference from the previous model is that now we want to estimate the posterior distributions of all three parameters: \(\theta, n\) and \(m\). As before, \(n\) measures the influence of \(m\) on \(\theta\), but this time these variables are random. Deriving analytical equations for the distributions of parameters was simple in the previous model, but would be very difficult now. This is where MCMC methods really come in handy.

Our model is parameterized by hyperparameters \(s, r, \mu\) and \(\nu\). Those hyperparameters, together with the total number of tasty chocolate bars and number of bars tested, will be the arguments of the STAN model.

To obtain easily interpretable parameters for \(Gamma(s, r)\), note that if \(Y \sim Gamma(s, r)\), then

\[\begin{align} \mathbb{E}X &= s/r \\ Var X & = s/r^2 \end{align}\]

It follows that

\[\begin{align} s & = \mathbb{E}^2 X / Var X \\ r & = \mathbb{E} X / Var X. \end{align}\]

Now, recall that \(n\) represents the strength of our prior knowledge (the number of chocolate bars we’d need to check to alter our beliefs). So, if we believe that our beliefs are worth about \(20 \pm 10\) chocolate bars, it’s reasonable to parametrize the prior by specifying

\[\begin{align} s & = 10^2 / 10^2 = 1 \\ r & = 10 / 10^2 = 0.1. \end{align}\]

Your task:

data_list <- list(N = nrow(chocolate), X = sum(chocolate$Good=="Yes"), nu = 2, mu = 4, s=4, r=0.2)
chocolate.hierarchical.fit <- stan(model_code = chocolate.hierarchical.model, data = data_list, warmup=2000, chains = 1, iter = 12000)
## 
## SAMPLING FOR MODEL 'ffd1c75c1654e9d2121c3b4e93970b73' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:     1 / 12000 [  0%]  (Warmup)
## Chain 1: Iteration:  1200 / 12000 [ 10%]  (Warmup)
## Chain 1: Iteration:  2001 / 12000 [ 16%]  (Sampling)
## Chain 1: Iteration:  3200 / 12000 [ 26%]  (Sampling)
## Chain 1: Iteration:  4400 / 12000 [ 36%]  (Sampling)
## Chain 1: Iteration:  5600 / 12000 [ 46%]  (Sampling)
## Chain 1: Iteration:  6800 / 12000 [ 56%]  (Sampling)
## Chain 1: Iteration:  8000 / 12000 [ 66%]  (Sampling)
## Chain 1: Iteration:  9200 / 12000 [ 76%]  (Sampling)
## Chain 1: Iteration: 10400 / 12000 [ 86%]  (Sampling)
## Chain 1: Iteration: 11600 / 12000 [ 96%]  (Sampling)
## Chain 1: Iteration: 12000 / 12000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.049034 seconds (Warm-up)
## Chain 1:                0.263107 seconds (Sampling)
## Chain 1:                0.312141 seconds (Total)
## Chain 1:
hierarchical.thetas <- extract(chocolate.hierarchical.fit, "theta", permuted=F)
theta.plot.data <- data.frame("theta"=c(hierarchical.thetas, thetas), "model" = rep(c("hierarchical", "simple"), each=10000))
ggplot(theta.plot.data, aes(x=theta)) + facet_grid(~model) + xlim(c(0.9, 1)) + geom_histogram(binwidth=0.002)
## Warning: Removed 4 rows containing missing values (geom_bar).

Note that the posterior from the hierarchical model is slightly wider. It’s standard deviation is 0.00391 compared with 0.003895 for the simple model. This is reasonable, since we’ve added more uncertainity by adding next layer of randomness.

Note that the posterior distribution for \(m\) is much different from the prior one. Our model has learned that chocolate is actually rather good. On the other hand, the posterior distribution for \(n\) did not change that much. This is because in our data, there is not much information about how strongly \(m\) should influence \(\theta\). We’ll get to that in the next section.

Advanced Chocolate Modeling

So far, we’ve only inspected the overall proportion of tasty chocolate bars. The purpose of this task was to illustrate how to quantify the uncertainity in data. This is extremely important in medical applications, because we need to be certain that a drug has no adverse effects. But in the case of chocolate, it’s rather useless.

What’s not useless, is using more advanced hierarchical models to compare the quality of chocolate of different companies. Such analyses are useful e.g. in market analyses.

Let’s assume that company \(i\) has quality \(\theta_i\). We define quality as the proportion of tasty chocolate bars produced by this company. Each company has it’s own quality, but it’s reasonable to assume that these qualities come from a single distribution. This distribution represents overall score of chocolates (or our prior beliefs about chocolate’s tastiness regardless of the company it comes from).

Such approach allows us to “exchange” information between companies. Suppose that we test two bars from a single company, and it turns out that both are tasty. Does this mean that the company’s quality is 100%? No! Just two bars is way too few to conclude that. However, we can conclude that this company’s quality might be slightly higher than the average quality of the other companies. This would be reflected in the posterior distribution of this company’s quality, \(\theta_i\) - it would be slightly higher than the average posterior distribution of quality.

Let \(X_i\) be the number of tasty chocolate bars produced by company \(i\). We consider the following hierarchical model:

\[\begin{align} X_i & \sim Bin(\theta_i, N_i) \\ \theta_i & \sim Beta(nm, n(1-m)) \\ n & \sim Gamma(s, r), \\ m & \sim Beta(\mu, \nu). \end{align}\]

The only thing that’s changed is that now we have multiple parameters \(\theta_i\) (as many as the number of companies). Our data now consists of two vectors: first, encoding the numbers of tested chocolates of \(i\)-th company, and second, encoding the numbers of tasty bars. Now, the parameter \(m\) corresponds to overall scoring of the chocolate bars, so it plays the role of parameter \(\theta\) from the previous analyses.

The parameter \(n\) measures the influence of the prior scoring on the individual qualities. If \(n\) is large (say, over 20), then the qualities \(\theta_i\) are all similar to the overall quality \(m\). This is because parameters \(nm\) and \(n(1-m)\) are large, and distributions of \(\theta_i\) are very narrow (recall that \(m\) is the mean value of \(\theta_i\)). On the other hand, if \(n\) is small (say, around 5), then the companies differ greatly in their quality. Then, individual qualities \(\theta_i\) can be very different than the overall quality \(m\). This is because \(nm\) and \(n(1-m)\) are small, and the distributions of \(\theta_i\) are very wide. Therefore, individual companies add a lot of variability to the overall distribution of tastiness given by \(m\).

We will assume a fairly weak prior (i.e. one that does not give much information). For this, we will let \(n\) be equal to \(10 \pm 10\), which corresponds to \(s = 1\) and \(r = 0.1\). This means that our prior information is worth on average \(10 \pm 10\) chocolate bars, and that it is most probably not worth more than \(50\) bars (because the density of Gamma distribution is small for values larger than 50). Next, we will assume that the mean quality \(m\) is most probably between \(0.6\) and \(1\), and most probably not lower than \(0.5\). These conditions are satisfied by a prior on \(m\) with \(\mu = 10\) and \(\nu=2\). Feel free to adjust those parameters so that they reflect your own beliefs about chocolate ratings.

Your tasks:

head(company.data)
##                                 Brand Number Tasty
## A. Morin                     A. Morin     23    23
## Acalli                         Acalli      2     2
## Adi                               Adi      4     4
## Aequare (Gianduja) Aequare (Gianduja)      2     2
## Ah Cacao                     Ah Cacao      1     1
## Akesson's (Pralus) Akesson's (Pralus)      3     3
data_list <- list(L=nrow(company.data), N = company.data$Number, X = company.data$Tasty, nu = 2, mu = 10, s=1, r=0.1)
company.fit <- stan(model_code = company.model, data = data_list, warmup=10000, chains = 1, iter = 15000, control = list(max_treedepth = 25))
## 
## SAMPLING FOR MODEL 'e7908c76ebc23e77c4e072d913934bf7' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000423 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.23 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:     1 / 15000 [  0%]  (Warmup)
## Chain 1: Iteration:  1500 / 15000 [ 10%]  (Warmup)
## Chain 1: Iteration:  3000 / 15000 [ 20%]  (Warmup)
## Chain 1: Iteration:  4500 / 15000 [ 30%]  (Warmup)
## Chain 1: Iteration:  6000 / 15000 [ 40%]  (Warmup)
## Chain 1: Iteration:  7500 / 15000 [ 50%]  (Warmup)
## Chain 1: Iteration:  9000 / 15000 [ 60%]  (Warmup)
## Chain 1: Iteration: 10001 / 15000 [ 66%]  (Sampling)
## Chain 1: Iteration: 11500 / 15000 [ 76%]  (Sampling)
## Chain 1: Iteration: 13000 / 15000 [ 86%]  (Sampling)
## Chain 1: Iteration: 14500 / 15000 [ 96%]  (Sampling)
## Chain 1: Iteration: 15000 / 15000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 209.375 seconds (Warm-up)
## Chain 1:                99.1177 seconds (Sampling)
## Chain 1:                308.493 seconds (Total)
## Chain 1:
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
posterior.n <- extract(company.fit, "n", permuted=F)
posterior.m <- extract(company.fit, "m", permuted=F)
posterior.thetas <- extract(company.fit, "theta", permuted=F)
qplot(c(posterior.n), c(posterior.m))

my.companies <- data.frame(posterior.thetas[,,c(8, 120, 232, 398)])
my.companies <- melt(my.companies)
## No id variables; using all as measure variables
ggplot(data=my.companies, aes(x=value)) + facet_wrap(~variable) + geom_histogram(binwidth=0.002)

Let’s choose company number 232, “Machu Picchu Trading Co.”, to illustrate an important phenomenon. You can see that the plot of this company’s quality looks different - it clearly seems that this one is not doing as well as the others. If you inspect the data, you’ll see that two bars of this manufacturer were tested, and only one was tasty. However, note that the value 0.5 is not very probable in the posterior distribution. The so-called “90% Highest Density Interval” – the shortest interval encompassing 90% of the probability – is equal to \([0.82, 0.99]\). This means that in 90% of iterations, the quality belonged to this interval. Of course, it is possible that the quality is below 0.5 – it’s very unlikely, but possible.

This illustrates the phenomenon of information transfer between groups. We know that most of the bars are tasty. Therefore, even though only half of the tested bars of Machu Piccu Trading Co. are tasty, it’s unlikely that only half of all the bars of this company are tasty. After all, we’ve only tested two of them, and there may be hundreds of other tasty brands from this company. All we know is that this company may have lower quality than the other companies. How much lower? This is expressed by the 90% HDI - we can say that we’re 90% certain that the quality is between 0.82 and 1.

Compare this with the company number 8 (“Altus aka Cao Artisan”). In this case, based on the 90% HDI, we’re 90% certain that the quality is between 0.95 and 1. For this company, we’ve tested 10 bars and all were tasty. The same princile of information transfer applies: Even though all the bars were tasty, we’re not 100% sure that the quality is equal to 1, and all the brands of this company are equally good. In fact, it’s very unlikely that this company has never ever produced an untasty bar. But we’re 90% sure that the quality is very high – above 0.95.

We won’t cover the HDI in more detail now, but here’s a function to compute it if you’re interested:

HDIofMCMC = function( sampleVec , credMass=0.9 ) {
  # Computes highest density interval from a sample of representative values,
  #   estimated as shortest credible interval.
  # Arguments:
  #   sampleVec
  #     is a vector of representative values from a probability distribution.
  #   credMass
  #     is a scalar between 0 and 1, indicating the mass within the credible
  #     interval that is to be estimated.
  # Value:
  #   HDIlim is a vector containing the limits of the HDI
  sortedPts = sort( sampleVec )
  ciIdxInc = ceiling( credMass * length( sortedPts ) )
  nCIs = length( sortedPts ) - ciIdxInc
  ciWidth = rep( 0 , nCIs )
  for ( i in 1:nCIs ) {
    ciWidth[ i ] = sortedPts[ i + ciIdxInc ] - sortedPts[ i ]
  }
  HDImin = sortedPts[ which.min( ciWidth ) ]
  HDImax = sortedPts[ which.min( ciWidth ) + ciIdxInc ]
  HDIlim = c( HDImin , HDImax )
  return( HDIlim )
}