# Maximum Likelihood Estimation

In the previous post, we introduced the Accelerated Failure Time (AFT) model.

(1) We’d now like to estimate the weights . We will use a well-known technique in statistics called Maximum Likelihood Estimation (MLE):

Principle of Maximum Likelihood Estimation. Given a set of observations, the choice of parameter value that maximizes the likelihood of the observations is chosen as the “best” estimate for the parameter .

Whew! That’s a mouthful of words! In the remainder of the post, we’ll take some time to unpack this definition. Then in a subsequent post we’ll use MLE to estimate the weights in the AFT model (1).

To explain what MLE is doing, let’s take a simple example: suppose I’m giving you a two-sided coin, and you’d like to know whether it’s a fair coin or a loaded coin. If it’s fair, heads and tails will come up with half/half probabilities. If it’s not, the probability for heads will be some value other than 1/2. Let represent the (unknown) probability for heads. The behavior of the coin is modeled by the Bernoulli distribution: , where is the outcome of a coin throw.

Let’s also suppose we threw the coin 4 times in a row and obtained the following outcomes: where 1 indicates a head and 0 a tail. What is a good estimate of ? Should set to 0.5 or 0.7 or 0.9?

MLE lets us decide among all possible candidates for . Given the set of outcomes , we will compute the MLE estimate of . We first write the likelihood function for . The likelihood for is the probability of obtaining the specific observations given a particular value of :

(2) ( signifies the probability of an event. “|” (vertical bar) is read as “given that”)

The probability density function (PDF) for the Bernoulli distribution is

(3) Plugging in (3) into the likelihood function , we obtain

(4) In deriving (4), we had assumed that the trials were i.i.d., in that

• The 4 trials are independent from one another. For example, the outcome of does not affect the outcome of .
• The 4 trials are all drawn from the same distribution, namely the Bernoulli distribution with parameter .

In the coin flip case, independence is a reasonable assumption.

We have obtained the likelihood function . Note that the likelihood function is particular to the set of observations . If we had different set of observations, we’d have different likelihood function. Once the likelihood function is found, we can use it as a yardstick to measure the “fitness” or “goodness” of candidates for :

 Candidate value for  0.5 0.0625 0.7 0.1020 0.9 0.0729

So out of candidates {0.5, 0.7, 0.9}, 0.7 is the best parameter according to the likelihood function . But how do we get the best estimate? We get it by maximizing :

(5) This value, 3/4, is called Maximum Likelihood Estimate. This value makes sense intuitively, since three-fourths of the observations were heads. We’d have a different Maximum Likelihood Estimate if the observations were different.

Over the years, statisticians have studied MLE extensively. MLE has many nice properties, such as efficiency. MLE is also well used in machine learning, since with MLE we can reduce the problem of parameter estimation to the problem of optimization. When introduced in 1920s, MLE was criticized as computationally difficult, as most uses of MLE require numerical optimization1. MLE grew attractive over time, as computational power (multi-core CPUs! GPUs! Clusters!) and optimization algorithms2 improved. Now MLE powers nooks and crannies of the machine learning enterprise.

1. Page 41, Computer Age Statistical Inference by Bradley Efron and Trevor Hastie (2016)
2. For example, see Convex Optimization by Stephen Boyd and Lieven Vandenberghe (2004)