Chocolate dataset

In the file Datasets/Chocolate/chocolate.csv you have a dataset with ratings of 1700 chocolate bars. The rating system is as follows:

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

We’re interested in how many chocolate brands are tasty. To check this, we will estimate the proportions of tasty chocolate bars in our dataset. We’ll assume that a chocolate bar is tasty if it’s rating is at least 2. We’ll investigate the dataset using the classical approach of maximum likelihood (MLE) and the bayesian approach.

Your tasks:

Maximum Likelihood Estimation

We’ll start with the classical approach, which we’ve already used many times. This time, however, we’ll do it more formally. Assume that the rating for each chocolate bar is idependent. A random chocolate bar is tasty with probability \(\theta\). It follows that the number of tasty chocolate bars follows a Binomial distribution \(Bin(n, \theta)\), where \(n\) is the number of inspected bars.

Your tasks:

MLE derivation: Let \(X \sim Bin(n, \theta)\); then, \(p(x|\theta) = {n \choose x}\theta^x(1-\theta)^{n-x}\). Treating \(x\) as constant, we look for \(\theta\) which maximizes \(p(x | \theta)\). To this end, we compute the derivative and look for it’s root:

\[\frac{\partial}{\partial \theta} p(x | \theta) = {n \choose x} x\theta^{x-1}(1-\theta)^{n-x} - {n \choose x} \theta^x (n-x) (1-\theta)^{n-x-1} = 0.\]

Divide both sides by \({n \choose x} \theta^{x-1} (1-\theta)^{n-x-1}\) to get

\[x(1-\theta) - \theta (n- x) = 0 \leftrightarrow -x\theta - n \theta + x \theta = -x \leftrightarrow \theta = \frac{x}{n}.\]

The result is very intuitive: to estimate the proportion of good chocolate bars, we simply compute the proportion from the data.

Overall, vast majority of chocolate bars are tasty. However, when we restricted our analysis to only four observations, we got very different results! Also, they’re not very credbile, because they’re inferred from just 4 chocolate bars. Bayesian analysis allows us to deal with this situation, by using so-called prior information.

Bayesian Parameter Estimation

Even without any dataset, we know that more often than not chocolate is tasty. This is our prior knowledge about chocolate. Buying 4 chocolate bars and finding out that 3 are disgusting wouldn’t change our belief about chocolate in general - rather, we would only conclude that some of the brands are not that good. Intuitively, we know that a sample of 4 chocolate bars is not enough to say anything about all chocolate - there is considerable amount of uncertainity in our results.

In the Bayesian approach, we treat the parameter \(\theta\) as a random variable with some distribution of our choice. The randomness represents our uncertainity about the true value of \(\theta\). As we will see, this has several advantages, among others:

As before, we will assume that, if we know \(\theta\), the number of tasty chocolate bars follows a binomial distribution \(Bin(n, \theta)\). Now, however, we will assume that \(\theta\) is itself random, and follows a Beta distribution \(Beta(a, b)\). The density of \(\theta\), \(p(\theta)\), is equal to

\[p(\theta) = \frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a, b)}.\]

Here, \(B(a, b)\) is a special function, and you don’t need to worry about it - its only purpose is to make \(p(\theta)\) integrate to 1. The distribution \(p(\theta)\) is called the prior distribution. The advantages of Binomial-Beta model is that the parameters \(a, b\) have a straightforward interpretation: they are the numbers of tasty and untasty chocolate bars in our prior knowledge. Their sum, \(n=a+b\), represents the strength of our prior belief. The fraction \(m=a/(a+b)\) represent our prior belief about the proportion of tasty chocolate bars. Later on we will use \(n\) and \(m\) to parametrize the prior.

The distribution of the number of tasty chocolate bars conditioned on \(\theta\) is denoted \(p(x | \theta)\). This is our binomial distribution. The marginal distribution of tasty chocolate bars, \(p(x)\), is obtained by intergrating over \(\theta\):

\[p(x) = \int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta.\]

This density represents the overall probability of observing \(x\) tasty chocolate bars, regardless on the parameter \(\theta\). It depends on the parameters \(a\) and \(b\).

To this point, we’ve covered the prior knowledge and it’s influence on our beliefs about the number of tasty chocolate bars. What about our goal - the estimation of \(\theta\)? In Bayesian approach, we’re interested in the posterior distribution of our parameters. The posterior distribution is the distribution of \(\theta\) conditioned on our observations, \(p(\theta | x)\). It measures how our beliefs about the values of \(\theta\) are influenced by the data. To get the posterior, we use the Bayes rule (hence the name of the approach):

\[p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} = \frac{p(x|\theta) p(\theta)}{\int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta}.\]

Note that \(\theta\) in the numerator is not the same as \(\theta\) in the denominator - the former is set by us, the latter is a variable under the integral.

Your tasks:

library(ggplot2)
a.values <- c(0.1, 0.5, 1, 2, 20)
b.values <- a.values
x.values <- 1:19/20  # to exclude end-points
plot.data <- data.frame('a'=numeric(), 'b'=numeric(), 'x'=numeric(), 'y'=numeric())
for(a in a.values){
  for(b in b.values){
    to.append <- data.frame('a'=a, 'b'=b, 'x'=x.values, 'y'=dbeta(x.values, a, b))
    plot.data <- rbind(plot.data, to.append)
  }
}

ggplot(data=plot.data, aes(x=x, y=y)) + geom_line() + facet_grid(a~b)
qplot(1760:1790, dbinom(1760:1790, 1795, mean(chocolate$Good == "Yes")), geom='line')