In the previous post, we introduced the Accelerated Failure Time (AFT) model.

(1)

We’d now like to estimate the weights . We will use a well-known technique in statistics called Maximum Likelihood Estimation (MLE):

Principle of Maximum Likelihood Estimation. Given a set of observations, the choice of parameter value that maximizes the likelihood of the observations is chosen as the “best” estimate for the parameter .

Whew! That’s a mouthful of words! In the remainder of the post, we’ll take some time to unpack this definition. Then in a subsequent post we’ll use MLE to estimate the weights in the AFT model (1).

To explain what MLE is doing, let’s take a simple example: suppose I’m giving you a two-sided coin, and you’d like to know whether it’s a fair coin or a loaded coin. If it’s fair, heads and tails will come up with half/half probabilities. If it’s not, the probability for heads will be some value other than 1/2. Let represent the (unknown) probability for heads. The behavior of the coin is modeled by the Bernoulli distribution: , where is the outcome of a coin throw.

Let’s also suppose we threw the coin 4 times in a row and obtained the following outcomes:

where 1 indicates a head and 0 a tail. What is a good estimate of ? Should set to 0.5 or 0.7 or 0.9?

MLE lets us decide among all possible candidates for . Given the set of outcomes , we will compute the MLE estimate of . We first write the likelihood function for . The likelihood for is the probability of obtaining the specific observations given a particular value of :

(2)

( signifies the probability of an event. “|” (vertical bar) is read as “given that”)

The probability density function (PDF) for the Bernoulli distribution is

(3)

Plugging in (3) into the likelihood function , we obtain

(4)

In deriving (4), we had assumed that the trials were i.i.d., in that

- The 4 trials are independent from one another. For example, the outcome of does not affect the outcome of .
- The 4 trials are all drawn from the same distribution, namely the Bernoulli distribution with parameter .

In the coin flip case, independence is a reasonable assumption.

We have obtained the likelihood function . Note that the likelihood function is particular to the set of observations . If we had different set of observations, we’d have different likelihood function. Once the likelihood function is found, we can use it as a yardstick to measure the “fitness” or “goodness” of candidates for :

Candidate value for | |

0.5 | 0.0625 |

0.7 | 0.1020 |

0.9 | 0.0729 |

So out of candidates {0.5, 0.7, 0.9}, 0.7 is the best parameter according to the likelihood function . But how do we get the **best** estimate? We get it by maximizing :

(5)

This value, 3/4, is called Maximum Likelihood Estimate. This value makes sense intuitively, since three-fourths of the observations were heads. We’d have a different Maximum Likelihood Estimate if the observations were different.

Over the years, statisticians have studied MLE extensively. MLE has many nice properties, such as efficiency. MLE is also well used in machine learning, since with MLE we can reduce the problem of parameter estimation to the problem of optimization. When introduced in 1920s, MLE was criticized as computationally difficult, as most uses of MLE require numerical optimization^{1}. MLE grew attractive over time, as computational power (multi-core CPUs! GPUs! Clusters!) and optimization algorithms^{2} improved. Now MLE powers nooks and crannies of the machine learning enterprise.

Survival analysis is a huge topic in statistics. For a comprehensive survey, see this article from ACM. In this post, we will cover one popular model known as **Accelerated Failure Time (AFT)**. The AFT model makes the following key assumptions:

Accelerated Failure Time assumption1. A unit increase in each input feature multiples the log survival time by a constant factor.

2. The effects of features on the log survival time are additive.

3. Noise in the training data is random and does not depend on any particular data point.

In math, we express the AFT assumption as follows:

(1)

where

- is the (true) label for the data point.
- is the value of feature for the data point.
- is the weight (coefficient) associated with the feature.
- is a random Gaussian noise drawn from the normal distribution with mean 0 and standard deviation 1, i.e. . We assume that are i.i.d.
- is a parameter that scales the size of the Gaussian noise .
- is the number of features available in the training data.
- is the size of training data.
- is the natural logarithm.

Let’s make a few observations. First, the multiplicative effect of features on the log survival time. Let’s suppose that we modified the feature in the training data point from to (added a unit) while keeping other features the same. Then the new value for is times the old value:

(2)

(3)

Second, the effects of features on the log survival time are additive. That is, we can increase the value of two features and simultaneously and their effects will add on top of each other. Define and , and is multiplied by :

(4)

(5)

Third, the error term is independent of the choice of data point . The terms are independently drawn from the standard normal distribution . This means that we can learn absolutely nothing about from either the data point or other error terms . In a future post, we will use this assumption to simplify the task of computing the maximum likelihood estimate for the weights .

Lastly, it is straightforward to predict with the AFT model. Given a previously unseen data point , we compute the point estimate of the log survival time as follows:

(6)

We’ve made quite a few assumptions here and there that made the AFT model simple to understand and use. Unfortunately, real world is messy and sometimes not all of these assumptions are justified. For example, what if we have two features that interacted with each other negatively or positively? Then the effects of feature increases won’t be additive. Or what if the error term is not i.i.d.? For example, it is known that clinical trials have a sex bias, with more men enrolled than women. In this case we’d have a good reason to suspect that the error term would have larger variance for women. The problem of non-i.i.d. error terms is well studied in the field of econometrics^{1}. For now, I’ll gloss over this issue, since I don’t know much of econometrics. However, the other issue, the non-additive interaction of features, will be addressed in a future post (Hint: use decision trees!)

Survival analysis is a discipline within statistics where the statistician models the distribution of **time to an event of interest**. The rest of this post will unpack this definition.

Survival analysis is a special kind of regression and differs from the conventional regression task as follows:

- The label is always positive, since you cannot wait a negative amount of time until the event occurs.
- The label may not be fully known, or
**censored**, because “it takes time to measure time.”

The second bullet point is crucial and we should dwell on it more. As you may have guessed from the name, one of the earliest applications of survival analysis is to model mortality of a given population. Let’s take NCCTG Lung Cancer Dataset as an example. The first 8 columns represent features^{1} and the last column, **Time to death**, represents the label.

Inst | Age | Sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss | Time to death (days) |

3 | 74 | 1 | 1 | 90 | 100 | 1175 | N/A | 306 |

3 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 | 455 |

3 | 56 | 1 | 0 | 90 | 90 | N/A | 15 | [1010, ) |

5 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 | 210 |

1 | 60 | 1 | 0 | 100 | 90 | N/A | 0 | 883 |

12 | 74 | 1 | 1 | 50 | 80 | 513 | 0 | [1022, ) |

7 | 68 | 2 | 2 | 70 | 60 | 384 | 10 | 310 |

… | … | … | … | … | … | … | … | … |

Take a close look at the label for the third patient. **His label is a range, not a single number.** The third patient’s label is said to be **censored**, because for some reason the experimenters could not get a complete measurement for that label. One possible scenario: the patient survived the first 1010 days and walked out of the clinic on the 1011th day, so his death was not directly observed. Another possibility: The experiment was cut short (since you cannot run it forever) before his death could be observed. In any case, his label is [1010, ), meaning his time to death can be any number that’s higher than 1010, e.g. 2000, 3000, or 10000.

There are four kinds of censorship in labels:

**Uncensored**: the label is not censored and given as a single number.**Right-censored**: the label is of form , where is the lower bound.**Left-censored**: the label is of form , where is the upper bound.**Interval-censored**: the label is of form , where and are the lower and upper bounds, respectively.

Right censoring is the most commonly used censorship type.

In the following posts, we will discuss how we can train a statistical model even with the presence of censored labels.

Also: see this excellent article from Uber, where Uber is using survival analysis to predict the duration between the first and second rides.

]]>**WordPress.com vs Self-hosting WordPress**. I initially tried out wordpress.com because it was easy to set up and I wouldn’t have to worry about the cost of hosting. However, I soon ran into a significant limitation: you are not allowed to install WordPress plugins. It is true that I could remove this limitation by buying the Business plan ($25/month), but then at this price range I might as well pay for hosting the blog myself. So I installed WordPress on the web host I was using for my personal website (NearlyFreeSpeech.net). The setup was a breeze, and I had blog.hyunsu-cho.io 20 minutes later.**QuickLaTeX plugin is awesome**. Why did I care so much about whether I can install plugins? Due to nature of this blog, I wanted top-quality presentation of mathematical formulas, and thus I needed a good way to embed LaTeX code in the posts. WordPress.com already has good support for inline formulas (see here), but unfortunately it lacked cross-references and did not allow embedding some “fancy” LaTeX packages I ended up needing (e.g. algorithmicx). But then I found the QuickLaTeX plugin and was blown away. I can use not only cross references and equation numbers but also use fancy LaTeX packages to typeset pseudocode. See it yourself:

**Writing is not easy but fun**. I thought I was pretty decent in writing. But as I wrote the first posts, I found myself re-writing sentences over and over again. First time, sentences would come out really awkward, then successively they’d get better and read more naturally. So writing is work. But it’s fun too: I get to organize what I learned. Somehow writing down my thought makes it clearer.

**How did I get involved?** On March 8, Toby Hocking^{1} floated the idea of co-mentoring a student to work on XGBoost. Eager to get anyone to contribute to XGBoost, I took up on the offer.

**What kind of work did we do?** From May 6 to September 3, I mentored Avinash Barnwal^{2} to add a new objective function called Accelerated Failure Time (AFT). Well known model in the field of statistics, AFT is a popular model of choice for survival analysis, i.e. modeling time to event. See Avinash’s post for a summary.

**Show me the code!** See https://github.com/dmlc/xgboost/pull/4763

**Personal takeaway** Before GSOC, I had no idea what survival analysis is, let alone AFT. Thankfully, Avinash gave me a bunch of papers to get me up to speed. I also got to ask him many questions on the subject of survival analysis. I intend to write a series of posts to summarize what I learned this summer.

Hence this blog. This blog is meant to be an accessible and comprehensive collection of everything XGBoost. Also, it would serve as study notes for myself, as I review and study different concepts.

]]>