What is survival analysis?

Survival analysis is a discipline within statistics where the statistician models the distribution of time to an event of interest. The rest of this post will unpack this definition.

Survival analysis is a special kind of regression and differs from the conventional regression task as follows:

  • The label is always positive, since you cannot wait a negative amount of time until the event occurs.
  • The label may not be fully known, or censored, because “it takes time to measure time.”

The second bullet point is crucial and we should dwell on it more. As you may have guessed from the name, one of the earliest applications of survival analysis is to model mortality of a given population. Let’s take NCCTG Lung Cancer Dataset as an example. The first 8 columns represent features1 and the last column, Time to death, represents the label.

InstAgeSexph.ecogph.karnopat.karnomeal.calwt.lossTime to death (days)
37411901001175N/A306
368109090122515455
356109090N/A15[1010, \mathbf{+\infty})
557119060115011210
1601010090N/A0883
12741150805130[1022, \mathbf{+\infty})
76822706038410310

Take a close look at the label for the third patient. His label is a range, not a single number. The third patient’s label is said to be censored, because for some reason the experimenters could not get a complete measurement for that label. One possible scenario: the patient survived the first 1010 days and walked out of the clinic on the 1011th day, so his death was not directly observed. Another possibility: The experiment was cut short (since you cannot run it forever) before his death could be observed. In any case, his label is [1010, \mathbf{+\infty}), meaning his time to death can be any number that’s higher than 1010, e.g. 2000, 3000, or 10000.

There are four kinds of censorship in labels:

  • Uncensored: the label is not censored and given as a single number.
  • Right-censored: the label is of form [a, +\infty), where a is the lower bound.
  • Left-censored: the label is of form (-\infty, b], where b is the upper bound.
  • Interval-censored: the label is of form [a, b], where a and b are the lower and upper bounds, respectively.

Right censoring is the most commonly used censorship type.

In the following posts, we will discuss how we can train a statistical model even with the presence of censored labels.

Also: see this excellent article from Uber, where Uber is using survival analysis to predict the duration between the first and second rides.

  1. Also known as covariates in statistics literature. As I am a computer scientist by training, I will consistent use the term “feature” to refer to an input variable.

Leave a Reply

Your email address will not be published. Required fields are marked *