Lab11

Chocolate dataset

In the file Datasets/Chocolate/chocolate.csv you have a dataset with ratings of 1700 chocolate bars. The rating system is as follows:

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

We’re interested in how many chocolate brands are tasty. To check this, we will estimate the proportions of tasty chocolate bars in our dataset. We’ll assume that a chocolate bar is tasty if it’s rating is at least 2. We’ll investigate the dataset using the classical approach of maximum likelihood (MLE) and the bayesian approach.

Your tasks:

Load the dataset
- You can use the function read_csv from package readr
Find the column with the rating value
Append a column called Good with values Yes and No, depending on whether the rating is at least two
- Use the ifelse function

Maximum Likelihood Estimation

We’ll start with the classical approach, which we’ve already used many times. This time, however, we’ll do it more formally. Assume that the rating for each chocolate bar is idependent. A random chocolate bar is tasty with probability \(\theta\). It follows that the number of tasty chocolate bars follows a Binomial distribution \(Bin(n, \theta)\), where \(n\) is the number of inspected bars.

Your tasks:

Show that the MLE for \(\theta\) is \(x/n\), where \(x\) is the number of tasty chocolate bars.
Estimate \(\theta\) on the full dataset.
Assume we need to do our analysis on a small, randomly selected sample. Suppose that in our sample we have chocolate bars number 126, 1412, 989 and 623. Compute the MLE from this sample.
- Select the sample using vectorized indexing and the function c().

MLE derivation: Let \(X \sim Bin(n, \theta)\); then, \(p(x|\theta) = {n \choose x}\theta^x(1-\theta)^{n-x}\). Treating \(x\) as constant, we look for \(\theta\) which maximizes \(p(x | \theta)\). To this end, we compute the derivative and look for it’s root:

\[\frac{\partial}{\partial \theta} p(x | \theta) = {n \choose x} x\theta^{x-1}(1-\theta)^{n-x} - {n \choose x} \theta^x (n-x) (1-\theta)^{n-x-1} = 0.\]

Divide both sides by \({n \choose x} \theta^{x-1} (1-\theta)^{n-x-1}\) to get

\[x(1-\theta) - \theta (n- x) = 0 \leftrightarrow -x\theta - n \theta + x \theta = -x \leftrightarrow \theta = \frac{x}{n}.\]

The result is very intuitive: to estimate the proportion of good chocolate bars, we simply compute the proportion from the data.

Overall, vast majority of chocolate bars are tasty. However, when we restricted our analysis to only four observations, we got very different results! Also, they’re not very credbile, because they’re inferred from just 4 chocolate bars. Bayesian analysis allows us to deal with this situation, by using so-called prior information.

Bayesian Parameter Estimation

Even without any dataset, we know that more often than not chocolate is tasty. This is our prior knowledge about chocolate. Buying 4 chocolate bars and finding out that 3 are disgusting wouldn’t change our belief about chocolate in general - rather, we would only conclude that some of the brands are not that good. Intuitively, we know that a sample of 4 chocolate bars is not enough to say anything about all chocolate - there is considerable amount of uncertainity in our results.

In the Bayesian approach, we treat the parameter \(\theta\) as a random variable with some distribution of our choice. The randomness represents our uncertainity about the true value of \(\theta\). As we will see, this has several advantages, among others:

We can incorporate our prior knowledge,
Small samples won’t lead us to wrong conclusions as much as before,
We can quantify our uncertainity about the parameters.

As before, we will assume that, if we know \(\theta\), the number of tasty chocolate bars follows a binomial distribution \(Bin(n, \theta)\). Now, however, we will assume that \(\theta\) is itself random, and follows a Beta distribution \(Beta(a, b)\). The density of \(\theta\), \(p(\theta)\), is equal to

\[p(\theta) = \frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a, b)}.\]

Here, \(B(a, b)\) is a special function, and you don’t need to worry about it - its only purpose is to make \(p(\theta)\) integrate to 1. The distribution \(p(\theta)\) is called the prior distribution. The advantages of Binomial-Beta model is that the parameters \(a, b\) have a straightforward interpretation: they are the numbers of tasty and untasty chocolate bars in our prior knowledge. Their sum, \(n=a+b\), represents the strength of our prior belief. The fraction \(m=a/(a+b)\) represent our prior belief about the proportion of tasty chocolate bars. Later on we will use \(n\) and \(m\) to parametrize the prior.

The distribution of the number of tasty chocolate bars conditioned on \(\theta\) is denoted \(p(x | \theta)\). This is our binomial distribution. The marginal distribution of tasty chocolate bars, \(p(x)\), is obtained by intergrating over \(\theta\):

\[p(x) = \int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta.\]

This density represents the overall probability of observing \(x\) tasty chocolate bars, regardless on the parameter \(\theta\). It depends on the parameters \(a\) and \(b\).

To this point, we’ve covered the prior knowledge and it’s influence on our beliefs about the number of tasty chocolate bars. What about our goal - the estimation of \(\theta\)? In Bayesian approach, we’re interested in the posterior distribution of our parameters. The posterior distribution is the distribution of \(\theta\) conditioned on our observations, \(p(\theta | x)\). It measures how our beliefs about the values of \(\theta\) are influenced by the data. To get the posterior, we use the Bayes rule (hence the name of the approach):

\[p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} = \frac{p(x|\theta) p(\theta)}{\int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta}.\]

Note that \(\theta\) in the numerator is not the same as \(\theta\) in the denominator - the former is set by us, the latter is a variable under the integral.

Your tasks:

Derive the posterior distribution \(\theta | X\). To simplify the derivation, note that in the denominator \(p(x)\) doesn’t depend on \(\theta\) - it’s a constant, so you can ommit it. Constants don’t matter, because they can be inferred at the end from the fact that everything needs to integrate to 1.
- Hint: The answer is \(Beta(a+X, b+n-X)\) (i.e. we add \(X\) successes and \(n-X\) failures to our beliefs about \(\theta\)).
Derive the distribution of \(X\) (i.e. not conditioned on \(\theta\))
- Remark: The resulting distribution is called Beta-Binomial.
Plot the density of the Beta distribution for several values of \(a, b > 0\) (pick your own values). Interpret the shape in terms of the prior knowledge.
- Use the function \(dbeta\) for the density
- You can use facet_grid() function in ggplot2 library to make a grid of plots, like below:

library(ggplot2)
a.values <- c(0.1, 0.5, 1, 2, 20)
b.values <- a.values
x.values <- 1:19/20  # to exclude end-points
plot.data <- data.frame('a'=numeric(), 'b'=numeric(), 'x'=numeric(), 'y'=numeric())
for(a in a.values){
  for(b in b.values){
    to.append <- data.frame('a'=a, 'b'=b, 'x'=x.values, 'y'=dbeta(x.values, a, b))
    plot.data <- rbind(plot.data, to.append)
  }
}

ggplot(data=plot.data, aes(x=x, y=y)) + geom_line() + facet_grid(a~b)

Pick some parameters \(a, b\) of your choice. Simulate \(1000\) random numbers of tasty chocolate bars with the assumption that we don’t know the exact value of \(\theta\) - instead, we only know it’s distribution. To do this, first simulate \(1000\) observations of \(\theta_i\), and then, for each \(\theta_i\), simulate \(X_i \sim Bin(n, \theta_i)\). Assume there are \(1700\) chocolate bars overall. Plot the results on a histogram.
- This task might seem complicated, but it’s actually really easy - you only need one call to \(rbeta\) and one call to \(rbinom\).
Assume \(a = b = 1\). This represents lack of knowledge - we assume that in our prior knowledge, we’ve observed only \(a=1\) tasty bar and \(b=1\) untasty bar. This means that we only know that there are tasty and untasty chocolate bars. Assume that in reality, always exactly \(90\%\) of chocolate bars are tasty, and that this fact holds forever everywhere like a law of nature (not really realistic assumption, is it?). With this assumption, simulate the number of tasty chocolate bars for \(n=1, 4, 10, 100, 10000\) and plot the posterior distribution. See how our belief about the tastiness of chocolate changes with the number of bars we’ve eaten.
Personally, I believe that \(99\%\) of chocolate bars are tasty, and you would need to show me \(100\) disgusting bars to change my beliefs. Interpret my beliefs in terms of the parameters of the prior Beta distribution. Check how my beliefs would change if you’d give me the bars number 126, 1412, 989 and 623. What about all the 1700 bars? Plot the posterior densities.
- Remark: In reality, I’m not exactly sure how many bars I’d need to check to change my beliefs. However, I could say that this number is a Poisson random variable with parameter \(\lambda=20\). This leads us to a hierarchical model, which has several “layers” of randomness. We’ll get back to that on the next labs.
After having estimated the posterior distribution \(\theta | X\), we can further inspect the predictions of our model. To this end, simulate \(10 000\) numbers of tasty chocolate bars from a Binomial distribution, where \(n\) is the true number of chocolate bars in the data (1795), and \(\theta\) comes from the posterior distribution. Plot the results on the histogram. What is the mean number of tasty chocolate bars? Does it agree with the number in the data?
We’ve obtained some distribution of tasty chocolate bars and compared the means. How about the distributions itself? The problem is that we don’t have an empirical distribution, only one sample. To get a quasi-empirical distribution, we’ll use the bootstrap technique: Sample the chocolate bars with replacement and compute the number of tasty bars. Repeat 10 000 times. Plot the result on a histogram and compare with the results of the previous point. If you’re not sure if the distributions agree, run a statistical test.
How does the frequentist model compare to the Bayesian one? Compare the two histograms from above with the density of a binomial random variable with \(n\) equal to number of chocolate bars and \(p\) equal to estimated proportion of tasty chocolates.

qplot(1760:1790, dbinom(1760:1790, 1795, mean(chocolate$Good == "Yes")), geom='line')

Lab11

Michał Ciach

December 12, 2017

Chocolate dataset

Maximum Likelihood Estimation

Bayesian Parameter Estimation