II. Probabilistic Programming: Survival Analysis

Introduction

This is a second article in the series of Probabilistic Programming for Actuarial Science. The first article introduced the Bayesian approach to inferring risk rates and predicting future claims in an actuarial context and built a practical model in Greta, a probabilistic programming language, to conduct the analysis.

In a beta binomial model, a claim can occur any time during a fixed period. In that model, we know only whether or not a claim occurred during a period, but not when the claim occurred. We now expand this risk model to include time as an additional dimension, and develop a “survival analysis” model. These survival analysis models are also called “time-to-event” models. The models are useful for modeling business where the risk of a claim changes over time. Such models are ubiquitous in actuarial science and can be used for both life and health as well as P&C risks.

The Survival Analysis Model

Practically, we will now implement a “piecewise constant hazard model” (PCH) using Greta to demonstrate the method. A PCH model assumes that the hazard rate (hx) stays constant over a defined interval of time and then will change in the next interval of time and so on. This is one of the most useful models in actuarial science as it can be used for determining all sorts of tabular risk rates such as mortality, lapse, termination, incidence and so on where these risk rates change over time. For example, mortality changes by age and duration; termination rates change by duration. The PCH model need not be unidimensional but can be constructed over multiple dimensions. A mortality table, for example, can be specified over five dimensions: age, duration, calendar year, sex, and smoking status. The PCH model is therefore extremely flexible as it provides a means of generating rates that can fit into a standard actuarial specification and can be easily incorporated into spreadsheets and pricing systems.

For each interval, the basic PCH model assumes:

\(Pr(event~between~time~0~and~t) = 1-e^{-h_x.t} \dots (i)\)

or alternatively:

\(Pr(no~event~between~time~0~and~t) = e^{-h_x.t} \dots (ii)\)

\(h_x\) is the hazard rate per unit time for a particular interval, and t is the time measured from the start of that interval. For example, the force of mortality is a hazard rate. The PCH model assumes this hazard rate remains constant over a defined period with the possibility of multiple defined periods.

For example, suppose there are \(C=10\) intervals. Therefore, there would be 10 \(h_{x,c}\) rates to estimate. We can specify a vector \(\theta\) of these rates such that \(\theta = \{{h_{x,c} | c=1, \dots, C}\}\).

The first inference step is to estimate the 10 \(h_{x,c}\) values. Using Bayes from formula:

\[ \begin{aligned} Pr(\theta | D) = \frac{Pr(D | \theta). Pr(\theta)}{Pr(D)} \dots (i) \end{aligned} \]

we need to specify the probability of event \(D\) given \(\theta\), namely \(Pr(D | \theta)\) (called the “likelihood”) and \(Pr(\theta)\) (called the “prior” distribution). It turns out that \(Pr(D)\) which is the component of the Bayes formula that is seldom analytically tractable need not be specified.

Likelihood Function

Under this model, the likelihood function is the probability density function (PDF) of \(h_{x,c}\). When a claim occurs this is: \[ \begin{aligned} \frac{d}{dt} Pr(\text{event prior to time }t) &= \frac{d}{dt}\{1-e^{-h_x.t}\}\\ &=h_x.e^{-h_x.t} \end{aligned} \]

When no claim occurs between time 0 and time t in an interval, the likelihood is \(Pr(no~event~prior~to~time~t)=e^{-h_x.t}\). Compactly, the likelihood can be written as:

\[ \begin{aligned} Likelihood(h_{x,c}) &= h_{x,c}^\delta. e^{-h_{x,c}.t} \dots (iii)\\ \text{where } delta &= \begin{cases} \ 1 \text{ for a claim}\\ \ 0 \text{ for no claim}\\ \end{cases} \end{aligned} \]

The historic data, \(D\), is now defined for each record, \(i\), as the pair \(\{\delta_i, t_i\}\) for each insured in each interval (this will become clear in the example below).

And now for a beautiful and fortuitous mathematical result: We compare the likelihood defined above in equation (iii) to a Poisson likelihood. If \(X \sim Poisson(\lambda)\), then the likelihood function for this Poisson parameter is:

\[ \begin{aligned} Likelihood(\lambda) = \frac{\lambda ^ x.e^{-\lambda}}{x!}\dots (iv) \end{aligned} \]

In our case, \(\delta\) can only take on a value of 0 or 1, so it is clearly cannot be Poisson distributed, since the Poisson distribution places no upper bound on \(x\). But restricting \(x\) to 0 or 1, the likelihood function for \(h_{x,c}\) has the same mathematical form as the likelihood function for \(\lambda\). This can be seen by comparing formulae \((iii)\) and \((iv)\). Therefore, for computational modeling purposes, we can treat the delta values as if they were Poisson distributed, with parameter \(\lambda = h_{x,c}.t\). Here the data, \(D\), are the \(\delta\) values, namely 1 for an event, 0 for no event, and the \(t\) values which reflect the time each insured spends in a period until either a claim or an exit from the period occurs. This mathematical trick allows us to describe the data as Poisson distributed and use the built-in commands for a Poisson distribution in Greta to solve for our risk rates.

Finally, there is one additional requirement for specifying the likelihood: since \(h_{x,c}\) must be positive, We can ensure this restriction by modeling \(h_{x,c} = e^{\alpha_c}\), where the components of the \(\alpha_c\) vector can have positive or negative values.

In summary, the likelihood can be defined as \(\delta \sim Poisson (h_{x,c}.t)\) even though \(\delta\) will only take on 0 or 1 values.

The Prior Distribution

The final part of the model simply requires specifying the prior distribution for the \(\alpha_c\) parameters. There are many ways to do this, but one approach would be to set the the distribution of \(\alpha_c\) as \(\alpha_c \sim Normal(\mu_{\alpha}, \sigma_{\alpha})\). We will have more to say about picking values for \(\mu_{\alpha}\) and \(\sigma_{\alpha}\) later.

Programming the Model in Greta

We now implement the PCH model on a database showing mortality rates following a mastectomy. See http://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/mastectomy.html

The data describes the risk of dying after a patient receives a mastectomy. The key questions to ask are how this risk changes over time and does the risk change if the mastectomy is coincident with a metastasized tumor.

Load Data

# Time Series Analysis of Survival from Breast Cancer

library(greta)
library(ggplot2)
library(dplyr)
library(purrr)

# Set Directories
working_dir = paste0(getwd(), "/")
data_dir = paste0(working_dir, "data/")

# Load Data
mastectomy = read.delim(file = paste0(data_dir, "mast.txt"))
head(mastectomy)
##   time event metastized
## 1   23  TRUE         no
## 2   47  TRUE         no
## 3   69  TRUE         no
## 4   70 FALSE         no
## 5  100 FALSE         no
## 6  101 FALSE         no

The data has the following fields:

  • time = time until an event following a mastectomy
  • event = did a death occur or not. This field provides the \(\delta\) value
  • metastasized = was the mastectomy coincident with a metastasized tumor

The next step is to “chop” the data up into intervals where each interval has its own hazard rate. In this case, we arbitrarily pick intervals of 20 months. Each “step” will consist of a single interval, for example, the first time interval is from time 1 to 20, the second from time 21 to 40 and so on.

# Time Series Analysis of Survival from Breast Cancer

# Set time intervals
max_time = 230
time_interval = 20
no_intervals = ceiling(max_time / time_interval)+1 -1 #last category has no values

# Create "long" table with data split into each possible time cell
breaks = seq(from=0, length.out = no_intervals, by = time_interval)
times = pmax(0, pmin(time_interval, 17 -breaks))

breaks_mat = matrix(rep(breaks, times = nrow(mastectomy)), nrow = nrow(mastectomy), byrow = T)
times_mat = mastectomy$time-breaks_mat
times_mat = apply(times_mat,1, function(x) pmin(time_interval, x))
times_mat = apply(times_mat,1, function(x) pmax(0, x))
delta_mat = (times_mat < time_interval) & (times_mat != 0) & mastectomy$event

mast_long = data.frame(patient = rep(1:nrow(mastectomy), each = no_intervals),
                       interval = rep(1:no_intervals, times = nrow(mastectomy)),
                       time = c(t(times_mat)),
                       delta = ifelse(c(t(delta_mat)),1,0),
                       metastized = rep(mastectomy$metastized, each = no_intervals)) %>%
  filter(time != 0)
head(mast_long)
##   patient interval time delta metastized
## 1       1        1   20     0         no
## 2       1        2    3     1         no
## 3       2        1   20     0         no
## 4       2        2   20     0         no
## 5       2        3    7     1         no
## 6       3        1   20     0         no

The resulting database now shows, for example, that the first patient survived 20 months in interval 1 and then died after 3 months in interval 2. Patient 2 survived 20 months in each of interval 1 and interval 2 and then died after 7 months in interval 3.

The Greta Model

A Greta model has three basic steps to set up the model: (i) define the training data, (ii) define the prior, and (iii) define he likelihood.

  • The training data is identified using the “as_data” function.
  • The prior for \(\alpha\) is set at \(\alpha \sim Normal(\mu=-6,\sigma=10)\)
  • The likelihood for each record is set as \(\delta \sim Poisson(\lambda = t.e^{\alpha})\) as per the discussion above.

We first need to define the prior distribution.

In this case, we have little “prior knowledge” of an appropriate value for \(\alpha_c\) other than that the monthly hazard rate is likely very low, so we set a “weak” prior distribution of \(\alpha_c \sim Normal(-6, 2)\), which means we think there is a 95% chance that alpha will fall within the approximate range \(-9.92=-6-1.96*2\) to \(-2.08=-6+1.96*2\), and the hazard rate falls in the range of \(e^{-9.92}\) to \(e^{-2.08}\) or 4.9e-05 to 0.12. This is essentially a guess but with the restriction that we would be very surprised to see a period hazard rate above 12%.

We are now ready to specify the model in Greta

  # Define the data inputs (D)
  delta=as_data(mast_long$delta)
  time = as_data(mast_long$time)
  interval_num = as_data(mast_long$interval)

  # Define the prior for the alpha parameter
  
  # Set the prior to have a low probability of death, namely exp(-mu_alpha) is small
  mu_alpha=-6 
  sd_alpha=2 
  # this establishes a vector or alpha prior values
  alpha = normal(mean = mu_alpha, sd = sd_alpha, dim = no_intervals) 
  

  # Define the likelihood for the model
  hazard = exp(alpha)
  lambda = time * hazard[interval_num]
  
  # delta ~ Poisson(lambda)
  distribution(delta) = poisson(lambda)  

That’s it! By carefully constructing the training data, the model can be structured as a Poisson distribution and solved with the appropriate Greta built-in functions.

The key advantage of a probabilistic programming language is that you only need to specify the prior and likelihood. There is no need to program any of the numerical techniques to estimate the posterior distribution of \(\theta\). Greta abstracts this part away, and the numerical techniques become mostly a black box.

We can now establish and compile the model:

  # Establish the model
  m= model(hazard)
  #plot(m)

The above chart summarizes each step in the Bayesian model.

We now run the model and extract the simulated hazard or \(h_{x,c}\) parameters using Markov chain Monte Carlo (MCMC). MCMC is the numerical technique that is used to solve the Bayes formula for the desired parameters. See the Greta website on how to choose the chains, n_samples and warmup parameters.

  draws = mcmc(m, chains = 10,  n_samples = 500, warmup = 500)
## 
## running 10 chains simultaneously on up to 12 cores
## 
    warmup                                            0/500 | eta:  ?s          
    warmup ====                                      50/500 | eta: 15s          
    warmup ========                                 100/500 | eta: 11s          
    warmup ============                             150/500 | eta: 10s          
    warmup ================                         200/500 | eta:  8s          
    warmup ====================                     250/500 | eta:  7s | 5% bad 
    warmup ========================                 300/500 | eta:  5s | 5% bad 
    warmup ============================             350/500 | eta:  4s | 4% bad 
    warmup ================================         400/500 | eta:  3s | 3% bad 
    warmup ====================================     450/500 | eta:  1s | 3% bad 
    warmup ======================================== 500/500 | eta:  0s | 3% bad 
## 
  sampling                                            0/500 | eta:  ?s          
  sampling ====                                      50/500 | eta: 14s          
  sampling ========                                 100/500 | eta: 12s          
  sampling ============                             150/500 | eta: 10s          
  sampling ================                         200/500 | eta:  9s          
  sampling ====================                     250/500 | eta:  7s          
  sampling ========================                 300/500 | eta:  6s          
  sampling ============================             350/500 | eta:  4s          
  sampling ================================         400/500 | eta:  3s          
  sampling ====================================     450/500 | eta:  1s          
  sampling ======================================== 500/500 | eta:  0s
  summary(draws)
## 
## Iterations = 1:500
## Thinning interval = 1 
## Number of chains = 10 
## Sample size per chain = 500 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                   Mean       SD  Naive SE Time-series SE
## hazard[1,1]  0.0058706 0.002586 3.657e-05      1.017e-04
## hazard[2,1]  0.0080694 0.003270 4.625e-05      8.436e-05
## hazard[3,1]  0.0080516 0.003538 5.003e-05      1.073e-04
## hazard[4,1]  0.0080420 0.004129 5.839e-05      1.470e-04
## hazard[5,1]  0.0009238 0.001198 1.695e-05      6.496e-05
## hazard[6,1]  0.0054595 0.003708 5.245e-05      1.211e-04
## hazard[7,1]  0.0014102 0.001697 2.400e-05      5.620e-05
## hazard[8,1]  0.0083521 0.005830 8.245e-05      2.434e-04
## hazard[9,1]  0.0017292 0.002450 3.465e-05      9.051e-05
## hazard[10,1] 0.0061141 0.005892 8.332e-05      1.861e-04
## hazard[11,1] 0.0029187 0.004451 6.294e-05      1.972e-04
## hazard[12,1] 0.0075331 0.016818 2.378e-04      3.971e-04
## 
## 2. Quantiles for each variable:
## 
##                   2.5%       25%       50%      75%    97.5%
## hazard[1,1]  2.065e-03 0.0040123 0.0054769 0.007224 0.011970
## hazard[2,1]  2.934e-03 0.0057565 0.0076480 0.009935 0.015717
## hazard[3,1]  2.711e-03 0.0054346 0.0075537 0.010092 0.016328
## hazard[4,1]  2.178e-03 0.0049295 0.0072840 0.010407 0.018220
## hazard[5,1]  2.451e-05 0.0001611 0.0005001 0.001208 0.004370
## hazard[6,1]  7.614e-04 0.0027489 0.0046676 0.007243 0.014878
## hazard[7,1]  2.700e-05 0.0002912 0.0008283 0.001888 0.005937
## hazard[8,1]  1.017e-03 0.0041183 0.0070117 0.011143 0.023081
## hazard[9,1]  3.519e-05 0.0002894 0.0008172 0.002165 0.008270
## hazard[10,1] 4.521e-04 0.0021597 0.0042804 0.008091 0.022231
## hazard[11,1] 3.894e-05 0.0004119 0.0012699 0.003546 0.014733
## hazard[12,1] 4.492e-05 0.0005320 0.0019234 0.006628 0.053007

The above found a set of 12 \(h_{x,c}\) values. These hazard rates are summarized below and the 95% credible intervals are determined.

We now plot a summary of the results

quantile_col = function(mat, probs = 0.025){
  apply(mat, 2, function (x) quantile(x, probs = probs))
}


# Summarize Data

draws_hazard = do.call(rbind, draws) %>% 
  as_tibble() %>%
  select(starts_with("hazard[")) %>%
  as.matrix()

mast_plot = data.frame(time = 1:no_intervals, 
                       hazard = colMeans(draws_hazard),
                       lower = quantile_col(draws_hazard, 0.1),
                       upper = quantile_col(draws_hazard, 0.9))

ggplot(mast_plot) + theme_bw()+ 
  geom_point(aes(time, hazard)) + 
  #coord_cartesian(ylim = c(0,.1), expand = TRUE,default = FALSE, clip = "on") + ylim(0, .1) + 
  geom_ribbon(aes(time, ymin=lower, ymax=upper), fill = "blue", alpha=0.2)+
  scale_x_continuous(breaks = 1:no_intervals) +
  labs(y="Base Hazard Rate")

Analysis of Results

Within a straightforward framework, we are able to estimate the risk rates and credible intervals for a survival analysis dataset.

The PCH model assumes that each of the estimated parameters are independent from one another. The credible intervals around each hazard rate are determined by only the data within each respective interval. However, this independence assumption in most cases, and for the mastectomy database as well, seems unreasonable and contrary to actuarial intuition and experience. The true hazard rates from one interval to the next are expected to follow a gradual progression exhibiting some form of dependence. For the same reason, the credible intervals are likely too wide. This is a consequence of the limited volume of data within each interval.

We can also see that when there are few or no events within an interval, then the credible intervals are likely underestimated. See the results for interval 5, for example.

Therefore, we need a method of removing the independence assumption between intervals and allow for a smoothing of the posterior results. One way to fix this issue is to make the intervals wider and ensure that there are a minimum number of events within each designated interval. This method might result in a smoother set of risk rates, but the intervals might be too wide for our desired purposes, and could hide changes of risk rates within an interval. We could alternatively smooth the risk rates using a smoothing technique like Loess, but the effect on credible intervals is typically difficult to ascertain.

In a follow up blog post, we will address these issues head on and develop a general technique for adding dependence among risk rates, and smooth best estimates and credible intervals

Conclusion

This post describes a Bayesian approach for developing risk rates in a typical actuarial rate analysis exercise. We show the connection between the PCH model and the Poisson model. The PCH Model is then implemented in Greta, a probabilistic programming language.

Future blog posts will include:

  • How to add the metastasized field into the analysis and and enhance the model into a regression analysis
  • How to add dependence into the model among intervals
  • How to smooth results