Accelerated Failure Time model

Survival analysis is a huge topic in statistics. For a comprehensive survey, see this article from ACM. In this post, we will cover one popular model known as Accelerated Failure Time (AFT). The AFT model makes the following key assumptions:

Accelerated Failure Time assumption

1. A unit increase in each input feature multiples the log survival time by a constant factor.
2. The effects of features on the log survival time are additive.
3. Noise in the training data is random and does not depend on any particular data point.

In math, we express the AFT assumption as follows:

(1)   \begin{equation*}\ln{(y_i)} = \sum_{j=1}^d w_j x_{ij} + \sigma\epsilon_i \quad\text{for each }i\in \{1, 2, \ldots, n\}\end{equation*}

where

  • y_i is the (true) label for the i^{\mathrm{th}} data point.
  • x_{ij} is the value of j^{\mathrm{th}} feature for the i^{\mathrm{th}} data point.
  • w_j is the weight (coefficient) associated with the j^{\mathrm{th}} feature.
  • \epsilon_i is a random Gaussian noise drawn from the normal distribution with mean 0 and standard deviation 1, i.e. \epsilon_i \sim  \mathcal{N}(0, 1). We assume that \epsilon_1, \epsilon_2, \ldots, \epsilon_n are i.i.d.
  • \sigma is a parameter that scales the size of the Gaussian noise \epsilon_i.
  • d is the number of features available in the training data.
  • n is the size of training data.
  • \ln{(*)} is the natural logarithm.

Let’s make a few observations. First, the multiplicative effect of features on the log survival time. Let’s suppose that we modified the k^{\mathrm{th}} feature in the i^{\mathrm{th}} training data point from x_{ik} to x_{ik}' = x_{ik} + 1 (added a unit) while keeping other features the same. Then the new value for y_i is \exp{(w_k)} times the old value:

(2)   \begin{align*}\ln{(y_i')} &= \sum_{j\neq k} w_j x_{ij} +w_k x_{ik}'  + \sigma\epsilon_i\\&= \sum_{j\neq k} w_j x_{ij} +w_k (x_{ik} + 1)  + \sigma\epsilon_i\\&= \sum_{j\neq k} w_j x_{ij} +w_k x_{ik} + w_k + \sigma\epsilon_i\\&= \sum_{j = 1}^d w_j x_{ij} + \sigma\epsilon_i + w_k\\&= \ln{(y_i)} + w_k\end{align*}

(3)   \begin{equation*}\therefore y_i' = y_i \cdot \exp{(w_k)}\end{equation*}

Second, the effects of features on the log survival time are additive. That is, we can increase the value of two features k_1^{\mathrm{th}} and k_2^{\mathrm{th}} simultaneously and their effects will add on top of each other. Define x_{i k_1}' = x_{i k_1} + 1 and x_{i k_2}' = x_{i k_2} + 1, and y_i is multiplied by \exp{(w_{k_1})} \exp{(w_{k_2}):

(4)   \begin{align*}\ln{(y_i')} &= \sum_{j\notin \{k_1, k_2\}} w_j x_{ij} +w_{k_1} x_{i k_1}'  + w_{k_2} x_{i k_2}'  + \sigma\epsilon_i\\&= \sum_{j\notin \{k_1, k_2\}} w_j x_{ij} +w_{k_1} x_{i k_1}  + w_{k_2} x_{i k_2} + w_{k_1} + w_{k_2} + \sigma\epsilon_i\\&= \sum_{j = 1}^d w_j x_{ij} + \sigma\epsilon_i + w_{k_1} + w_{k_2}\\&= \ln{(y_i)} + w_{k_1} + w_{k_2}\end{align*}

(5)   \begin{equation*}\therefore y_i' = y_i \cdot \exp{(w_{k_1})}  \exp{(w_{k_2})}\end{equation*}

Third, the error term \epsilon_i is independent of the choice of data point i. The terms \epsilon_1, \epsilon_2, \ldots, \epsilon_n are independently drawn from the standard normal distribution \mathcal{N}(0, 1). This means that we can learn absolutely nothing about \epsilon_i from either the data point \mathbf{x}_i = (x_{i1}, \ldots, x_{id}) or other error terms \epsilon_{*}. In a future post, we will use this assumption to simplify the task of computing the maximum likelihood estimate for the weights w_{*}.

Lastly, it is straightforward to predict with the AFT model. Given a previously unseen data point \mathbf{x}^{\mathrm{new}} = (x_1^{\mathrm{new}}, \ldots, x_{d}^{\mathrm{new}}), we compute the point estimate of the log survival time \hat{y}^{\mathrm{new}} as follows:

(6)   \begin{equation*}\hat{y}^{\mathrm{new}} = \exp{\left(\sum_{j=1}^d w_j x_{j}^{\mathrm{new}}\right)}\end{equation*}

We’ve made quite a few assumptions here and there that made the AFT model simple to understand and use. Unfortunately, real world is messy and sometimes not all of these assumptions are justified. For example, what if we have two features that interacted with each other negatively or positively? Then the effects of feature increases won’t be additive. Or what if the error term \epsilon_{*} is not i.i.d.? For example, it is known that clinical trials have a sex bias, with more men enrolled than women. In this case we’d have a good reason to suspect that the error term \epsilon_{*} would have larger variance for women. The problem of non-i.i.d. error terms is well studied in the field of econometrics1. For now, I’ll gloss over this issue, since I don’t know much of econometrics. However, the other issue, the non-additive interaction of features, will be addressed in a future post (Hint: use decision trees!)

  1. For example, see Chapter 7-8 of A Guide to Econometrics, 6E, by Peter Kennedy.

Leave a Reply

Your email address will not be published. Required fields are marked *