# Point Estimation Basics: Biased Coin

Before we get to any complex learning algorithms, let's begin with the most basic learning problem. A biased coin.

Suppose you found a funny looking coin on your way to class and, being a scientist, started flipping it. What is the simplest model of this scientific experiment? Each flip is a Bernoulli (binary) variable, identically and independently distributed (i.i.d.):

P(Heads) = \theta \; {\rm and} \;P(Tails) = 1-\theta, \; {\rm where}\; \theta \in [0,1].

So if you flip the coin N times, with H heads and T tails, the likelihood of observing a sequence (data) D is

P(D\mid\theta) = \theta^{H}(1-\theta)^{T}.

## Maximum Likelihood Estimation (MLE) for the coin

You observe 3 heads and 2 tails. What's your guess for \theta? If you guessed 3/5, you might be doing MLE, which is simply finding a model that best explains your experiment:

\hat\theta_{MLE} = \arg \max_\theta P(D\mid\theta) = \arg \max_\theta \;\;\; \log P(D\mid\theta)

Why log? Because when we take the log, the log of a product becomes a sum of logs:

  \log P(D\mid\theta) = H\log(\theta) + T\log(1-\theta).


How do we find the \theta that maximizes the log-likelihood of the data? Let's try to set the derivative to zero:

\frac{d}{d\theta} \log P(D\mid\theta) = 0 \rightarrow \hat\theta_{MLE} = \frac{H}{H+T}
.

Why and when is this a good idea? In practice, MLE principle is a basis for a majority of practical parameter estimation algorithms, as we will see. If you have enough data and the correct model, MLE is all you need: it will find the true parameters.

The problem is that you rarely have either enough data or the correct model, so MLE can be terribly wrong. Example: suppose all five tosses come up heads, would you believe MLE? What about five million tosses, all heads?

## Learning guarantees

How sure can you be in your estimate, supposing you still believe your model? The following bound (based on Hoeffding inequality) will be very useful later in the course:

P(|\hat\theta_{MLE} - \theta^*| \ge \epsilon)\le 2e^{-2N\epsilon^2,}

where \theta^* is the true parameter and \epsilon > 0.

In words: The probability of the observed proportion of heads deviating by \epsilon from the true proportion in an i.i.d. sample of N tosses decreases exponentially with N and \epsilon. So if I want to be 95% sure that I guessed the proportion plus/minus 0.1. We have

0.05 = 2e^{-2N*0.1^2} \rightarrow N = \frac{\log \frac{2}{0.05}}{2*0.1^2}\approx 185

Here and below, log is the natural log. In general, if I want 1-\delta confidence with \epsilon (absolute) error, I need

N \ge \frac{\log \frac{2}{\delta}}{2\epsilon^2}.

Here are the required number of independent tosses for a several values of confidence \delta: A plot of the bound for several values of \delta.

Guarantees of this kind is what probably approximately correct (PAC) framework aims to derive for learning algorithms, where the key goal is polynomial or better dependence of N (sample complexity) on other factors. We'll talk about this much more later in the course.

## The Bayesian way

The fact is, that you have some experience with coins, so when you don't have enough data, you rely on your prior knowledge that most coins are pretty fair to estimate the parameter. If you're a Bayesian, you will embrace uncertainty but quantify it, by assuming that your prior belief about coin fairness can be described with a distribution P(\theta). In particular, the Beta distribution, with several examples shown below, is a very convenient class of priors: We'll talk about the exact form of the prior in a moment, but first, what do we do with it? The Bayesian framework uses data to update its distribution over the parameters. Using Bayes rule, we obtain a posterior over parameters:

P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \propto P(D | \theta) P(\theta)

Now let's look at the Beta distribution:

Beta(\alpha,\beta):\;\; P(\theta \mid \alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1},

where \Gamma is the gamma function (for positive integers n, \Gamma(n)=(n-1)!). The hyperparameters \alpha\ge 1,\beta \ge 1 control the shape of the prior: their sum controls peakiness and their ratio controls the left-right bias. Why is Beta so convenient? Here's why:

P(\theta | D) \propto P(D | \theta) P(\theta) \propto \theta^{H+\alpha-1}(1-\theta)^{T+\beta-1}

Hence, P(\theta | D) = Beta(H+\alpha;T+\beta). This property of fit between a model and its prior is called conjugacy, which essentially means the posterior is of the same distributional family as the prior.

OK, so now we have a posterior, but what if we want a single number (an estimate). The most common answer, basically because it is often computationally the easiest, is the maximum a posteriori (MAP) estimate:

\hat\theta_{MAP} = \arg \max_\theta P(\theta\mid D) = \arg \max_\theta \left( (\log P(D\mid\theta) + \log P(\theta) \right)

For the Beta(\alpha;\beta) prior, the MAP estimate is:

\hat\theta_{MAP} = \frac{H+\alpha-1}{H+T+\alpha+\beta-2}

Note that as N = H+T \rightarrow \infty, the prior's effect vanishes and we recover MLE, which is what we want: in the evidence of enough data, prior shouldn't matter. We will see this vanishing prior effect in many scenarios.

MAP estimate is only the simplest Bayesian approach to parameter estimation. There is much more information in the posterior distribution P(\theta\mid D) , so one could ask for the posterior mean and variance of the parameter, for example. Why the mean? Suppose you want to predict what the probability of the next flip coming up heads is, given the seen data is exactly the posterior mean: