In this exercise (and the next one) we will use Chocolate Bar Ratings dataset representing Expert ratings of over 1,700 chocolate bars (Flavours of Chockolate).

The rating system is as follows:

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

The dataset file: chocolate.csv (>1.7k ratings)

Task 1: You are asked to analyse the dataset:

  1. Where is the worst and best chocolate produced?
  2. Where are the worst and best beans grown?
  3. Is there any correlation between cocoa percentage and rating?
  4. In which country the chocolate was tested the most?
  5. Make plots for:
    • the distribution of Cocoa % (histogram + distribution line)
    • the distribution of Cocoa % (domestic vs. non-domestic*, two histograms + two distribution line)
    • the distribution of rating
  6. do chocolate bars with 100% Cocoa are the best?

*Domestic means that "Company-Location"=="Broad-Bean-Origin".

For loading the dataset you can use the function read_csv from package readr.

Task 2:

  • Find the column with the rating value

  • Append a column called Good with values Yes and No, depending on whether the rating is at least two.

    • Use the ifelse function

Maximum Likelihood Estimation

We’ll start with the classical approach, which we’ve already used before. This time, however, we’ll do it more formally. Assume that the rating for each chocolate bar is idependent. A random chocolate bar is tasty with probability \(\theta\). We assume that the number of tasty chocolate bars follows a Binomial distribution \(Bin(n, \theta)\), where \(n\) is the number of inspected bars.

Your tasks:

MLE derivation: Let \(X \sim Bin(n, \theta)\); then, \(p(x|\theta) = {n \choose x}\theta^x(1-\theta)^{n-x}\). Treating \(x\) as constant, we look for \(\theta\) which maximizes \(p(x | \theta)\). To this end, we compute the derivative and look for it’s root:

\[\frac{\partial}{\partial \theta} p(x | \theta) = {n \choose x} x\theta^{x-1}(1-\theta)^{n-x} - {n \choose x} \theta^x (n-x) (1-\theta)^{n-x-1} = 0.\]

Divide both sides by \({n \choose x} \theta^{x-1} (1-\theta)^{n-x-1}\) to get

\[x(1-\theta) - \theta (n- x) = 0 \leftrightarrow -x\theta - n \theta + x \theta = -x \leftrightarrow \theta = \frac{x}{n}.\]

The result is very intuitive: to estimate the proportion of good chocolate bars, we simply compute the proportion from the data.

Overall, vast majority of chocolate bars are tasty. However, when we restricted our analysis to only four observations, we got very different results! Also, they’re not very credbile, because they’re inferred from just 4 chocolate bars. Bayesian analysis allows us to deal with this situation, by using so-called prior information.

Bayesian Parameter Estimation

Even without any dataset, we know that more often than not chocolate is tasty. This is our prior knowledge about chocolate. Buying 4 chocolate bars and finding out that 3 are disgusting wouldn’t change our belief about chocolate in general - rather, we would only conclude that some of the brands are not that good. Intuitively, we know that a sample of 4 chocolate bars is not enough to say anything about all chocolate - there is considerable amount of uncertainity in our results.

In the Bayesian approach, we treat the parameter \(\theta\) as a random variable with some distribution of our choice. The randomness represents our uncertainity about the true value of \(\theta\). As we will see, this has several advantages, among others:

As before, we will assume that, if we know \(\theta\), the number of tasty chocolate bars follows a binomial distribution \(Bin(n, \theta)\). Now, however, we will assume that \(\theta\) is itself random, and follows a Beta distribution \(Beta(a, b)\). The density of \(\theta\), \(p(\theta)\), is equal to

\[p(\theta) = \frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a, b)}.\]

Here, \(B(a, b)\) is a special function, and you don’t need to worry about it - its only purpose is to make \(p(\theta)\) integrate to 1. The distribution \(p(\theta)\) is called the prior distribution. The advantages of Binomial-Beta model is that the parameters \(a, b\) have a straightforward interpretation: they are the numbers of tasty and untasty chocolate bars in our prior knowledge. Their sum, \(n=a+b\), represents the strength of our prior belief. The fraction \(m=a/(a+b)\) represent our prior belief about the proportion of tasty chocolate bars. Later on we will use \(n\) and \(m\) to parametrize the prior.

The distribution of the number of tasty chocolate bars conditioned on \(\theta\) is denoted \(p(x | \theta)\). This is our binomial distribution. The marginal distribution of tasty chocolate bars, \(p(x)\), is obtained by intergrating over \(\theta\):

\[p(x) = \int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta.\]

This density represents the overall probability of observing \(x\) tasty chocolate bars, regardless on the parameter \(\theta\). It depends on the parameters \(a\) and \(b\).

To this point, we’ve covered the prior knowledge and it’s influence on our beliefs about the number of tasty chocolate bars. What about our goal - the estimation of \(\theta\)? In Bayesian approach, we’re interested in the posterior distribution of our parameters. The posterior distribution is the distribution of \(\theta\) conditioned on our observations, \(p(\theta | x)\). It measures how our beliefs about the values of \(\theta\) are influenced by the data. To get the posterior, we use the Bayes rule (hence the name of the approach):

\[p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} = \frac{p(x|\theta) p(\theta)}{\int_{\theta \in [0, 1]} p(x | \theta) p(\theta) d\theta}.\]

Note that \(\theta\) in the numerator is not the same as \(\theta\) in the denominator - the former is set by us, the latter is a variable under the integral.

Your tasks:

#install.packages('ggplot2')
library(ggplot2)
a.values <- c(0.1, 0.5, 1, 2, 20)
b.values <- a.values
x.values <- 1:19/20  # to exclude end-points
plot.data <- data.frame('a'=numeric(), 'b'=numeric(), 'x'=numeric(), 'y'=numeric())
for(a in a.values){
  for(b in b.values){
    to.append <- data.frame('a'=a, 'b'=b, 'x'=x.values, 'y'=dbeta(x.values, a, b))
    plot.data <- rbind(plot.data, to.append)
  }
}

ggplot(data=plot.data, aes(x=x, y=y)) + geom_line() + facet_grid(a~b)
qplot(1760:1790, dbinom(1760:1790, 1795, mean(chocolate$Good == "Yes")), geom='line')