Maximum Likelihood Estimation

In the previous post, we introduced the Accelerated Failure Time (AFT) model.

(1)   \begin{equation*}\ln{(y_i)} = \sum_{j=1}^d w_j x_{ij} + \sigma\epsilon_i\end{equation*}

We’d now like to estimate the weights w_1, w_2, \ldots, w_d. We will use a well-known technique in statistics called Maximum Likelihood Estimation (MLE):

Principle of Maximum Likelihood Estimation. Given a set of observations, the choice of parameter value \theta^{\mathrm{MLE}} that maximizes the likelihood of the observations is chosen as the “best” estimate for the parameter \theta.

Whew! That’s a mouthful of words! In the remainder of the post, we’ll take some time to unpack this definition. Then in a subsequent post we’ll use MLE to estimate the weights w_{*} in the AFT model (1).

To explain what MLE is doing, let’s take a simple example: suppose I’m giving you a two-sided coin, and you’d like to know whether it’s a fair coin or a loaded coin. If it’s fair, heads and tails will come up with half/half probabilities. If it’s not, the probability for heads will be some value other than 1/2. Let p represent the (unknown) probability for heads. The behavior of the coin is modeled by the Bernoulli distribution: c \sim \mathrm{Bernoulli}(p), where c is the outcome of a coin throw.

Let’s also suppose we threw the coin 4 times in a row and obtained the following outcomes:

    \[c_1 = 1, c_2 = 1, c_3 = 0, c_4 = 1,\]


where 1 indicates a head and 0 a tail. What is a good estimate of p? Should p set to 0.5 or 0.7 or 0.9?

MLE lets us decide among all possible candidates for p. Given the set of outcomes c_1, c_2, c_3, c_4, we will compute the MLE estimate of p. We first write the likelihood function for p. The likelihood for p is the probability of obtaining the specific observations c_1, c_2, c_3, c_4 given a particular value of p:

(2)   \begin{equation*}L(p) = \mathbb{P}[c_1 = 1, c_2 = 1, c_3 = 0, c_4 = 1 | p]\end{equation*}

(\mathbb{P}[*] signifies the probability of an event. “|” (vertical bar) is read as “given that”)

The probability density function (PDF) for the Bernoulli distribution is

(3)   \begin{equation*}\mathbb{P}[c = k] = \begin{cases} p &\text{if } k =1 \\ 1-p &\text{if } k = 0 \end{cases}\end{equation*}

Plugging in (3) into the likelihood function L, we obtain

(4)   \begin{align*}L(p) &= \mathbb{P}[c_1 = 1, c_2 = 1, c_3 = 0, c_4 = 1 | p] \\&= \mathbb{P}[c_1 = 1 | p]\mathbb{P}[c_2 = 1 | p]\mathbb{P}[c_3 = 0 | p]\mathbb{P}[c_4 = 1 | p] \\&= p^3 (1-p)\end{align*}

In deriving (4), we had assumed that the trials c_1, c_2, c_3, c_4 were i.i.d., in that

  • The 4 trials are independent from one another. For example, the outcome of c_1 does not affect the outcome of c_4.
  • The 4 trials are all drawn from the same distribution, namely the Bernoulli distribution with parameter p.

In the coin flip case, independence is a reasonable assumption.

We have obtained the likelihood function L(p) = p^3 (1-p). Note that the likelihood function is particular to the set of observations c_1, c_2, c_3, c_4. If we had different set of observations, we’d have different likelihood function. Once the likelihood function is found, we can use it as a yardstick to measure the “fitness” or “goodness” of candidates for p:

Candidate value for pL(p) = p^3 (1-p)
0.50.0625
0.70.1020
0.90.0729

So out of candidates {0.5, 0.7, 0.9}, 0.7 is the best parameter according to the likelihood function L. But how do we get the best estimate? We get it by maximizing L:

(5)   \begin{equation*}p^{\mathrm{MLE}} = \argmax_p{L(p)} = \argmax_p{p^3 (1-p)} = \frac{3}{4}.\end{equation*}

This value, 3/4, is called Maximum Likelihood Estimate. This value makes sense intuitively, since three-fourths of the observations were heads. We’d have a different Maximum Likelihood Estimate if the observations were different.

Over the years, statisticians have studied MLE extensively. MLE has many nice properties, such as efficiency. MLE is also well used in machine learning, since with MLE we can reduce the problem of parameter estimation to the problem of optimization. When introduced in 1920s, MLE was criticized as computationally difficult, as most uses of MLE require numerical optimization1. MLE grew attractive over time, as computational power (multi-core CPUs! GPUs! Clusters!) and optimization algorithms2 improved. Now MLE powers nooks and crannies of the machine learning enterprise.

  1. Page 41, Computer Age Statistical Inference by Bradley Efron and Trevor Hastie (2016)
  2. For example, see Convex Optimization by Stephen Boyd and Lieven Vandenberghe (2004)

Leave a Reply

Your email address will not be published. Required fields are marked *