I. Probabilistic Programming for Actuarial Science


A core task of actuarial analysis is estimation of risk rates for insured events. This includes analyses such as estimating mortality rates, lapse rates, incidence rates, termination rates and so on. In addition to determining best estimates for risk rates, understanding the uncertainty around the best estimates is a critical component of the overall analysis.

Probabilistic programming provides a very useful and powerful mechanism for tackling this overall task. The basic concept underpinning probabilistic programming is to establish a model where the unknown elements of the model are treated as random variables. In actuarial applications, the probabilistic program will typically have two components: an inference step to estimate the risk parameter; and a predictive step to estimate future claims or back test the model on historic experience.

There is now a wide variety of probabilistic programming languages including BUGS, JAGS, and Stan. In this post and in the series that follows, I am going to focus on a new language that is extremely powerful but relatively easy to use. The language is called Greta. It uses Google’s deep learning platform, Tensorflow, as its calculation engine. The big advantage of Greta is that it can be written entirely in R. Its syntax is simple and intuitive and it can harness the power of GPUs or multicore CPUs in a fast compiled language.

The website https://greta-stats.org contains a gentle introduction to the language. Background in Bayesian analysis is also needed. There are now many books on Bayesian analysis. The Bugs Book (Lunn), or Doing Bayesian Analysis (Kruschke) are good starting points. For more advanced practical approaches, the discussion forums on the Stan website as well as the Stan manual are fantastic. Although Stan is an alternative probabilistic programming language, the approaches used are very similar. Finally, for Greta-specific help, try the Greta forum.

How Probabilistic Programming Works

The key idea behind probabilistic programming in an actuarial context is to start with data (“\(D\)”) such as claims experience, generate the estimated distribution of the parameters or risk rates (“\(\theta\)”) of a claims model, then use these same parameters to predict future claims.

The Modeling Process Workflow

\(Data \rightarrow \theta \rightarrow Posterior~Predicted~Claims\)

For example, the probability of a claim (claims incidence) could be defined by a Bernoulli distribution, with parameter \(\theta\). Thus \(Pr(claim) = \theta\), where \(0\leq\theta\leq1\). In a Bayesian context, \(\theta\) is deemed to be a random variable with its own distribution. Inference involves using available claims data to estimate the posterior distribution of \(\theta\). Once the posterior distribution is known, we can then estimate future claims, using the “posterior predictive distribution” of future claims.

Bayes theorem underpins this two-step approach:

Step 1: Bayes Theorem:
\[ \begin{aligned} Pr(\theta | D) = \frac{Pr(D | \theta). Pr(\theta)}{Pr(D)} \dots (i) \end{aligned} \]

In an actuarial context, the data, \(D\), is typically the claims history, namely \(D=Claim_{history}\). By using the above formula, we can estimate the parameters of the model and achieve the first step, namely:

\(Data \rightarrow \theta\)

Step 2: Posterior Predictive Distribution:

The distribution of any new claim, \(Claim_{new}\), given the claims history, \(Claim_{history}\) is the posterior predictive distribution (PPD) of a claim:

\[ \begin{aligned} \text{PPD of } Claim_{new}&= Pr (Claim_{new} | D)\\ &= \int_{\theta}{Pr (Claim_{new}, \theta| D)d\theta} \\ &= \int_{\theta}{Pr (Claim_{new} | \theta).Pr(\theta| D)d\theta} \dots (ii) \end{aligned} \]

The PPD enables the second step:

\(\theta \rightarrow Posterior~Predicted~Claims\)

That is, the PPD is the weighted average probability of \(Pr(Claim_{new} | \theta)\) for each possible value of \(\theta\) where the weights are \(Pr(\theta| D)\).

There is a subtle but important difference between a “posterior” distribution and a “posterior predictive” distribution. The posterior distribution of \(\theta\) represents our internal belief state about the riskiness of a claim. The parameter \(\theta\) is an unobservable quantity. The posterior predictive distribution, on the other hand, is a prediction of an objectively observable quantity, in our case a claim (see Murphy for more a more in-depth discussion). The PPD therefore integrates out or marginalizes the unobservable or latent parameter \(\theta\), and provides a probability statement about an observable quantity, the incidence of a claim.

Using these two formulae, we can therefore both infer the risk rates of the business and predict future claims.

Unfortunately, for all but the simplest risk models, equation (i) cannot be determined analytically. We have to use numerical techniques to estimate \(\theta\), namely the risk rates.

Probablistic programming provides the means of specifying the above two equations and solving the problem without having to worry about the complexity of the numerical techniques underlying the estimation procedures.

In fact, what languages like Stan and Greta do is that they simulate S posterior samples, of \(\widetilde{\theta}_s ={\theta_s | D}\). If \(S=1000\), say, then these 1000 draws from the posterior, \(\widetilde{\theta}_s\) for \(s=1,...,1000\), provide an empirical distribution of the posterior risk parameter. We can then use these posterior draws, in turn, to predict a distribution of 1000 claims probabilities, providing an empirical distribution of the predicted claims.

An Example

Let’s take a look at a simple example that can in fact be solved analytically, and compare it to the approach in Greta.

Suppose we want to analyze the risk of a claim occuring based on a database of N policies, a portion of which produced claims.

If a claim is represented by a “1” and no claim by a 0, then the data, \(D\), will consist of N values of either 0 or 1.

Suppose the probability of a claim has a Bernoulli distribution, namely \(Pr(Claim_i=1) = \theta\), for \(i=1, ..., N\). Here, the \(Pr(Claim)\) and the risk parameter \(\theta\) are the same, but this will generally not be the case.

The Prior: Assume that the prior distribution for \(\theta\) is uniformly distributed between 0 and 1. That is \(Pr(\theta) = 1\) for \(0\leq\theta\leq1\). This prior distribution implies that without any data, the risk of a claim can lie anywhere between 0 and 1, with equal probability.

The Likelihood: If there is a claim for a policy, the likelihood of such event is \(\theta\). If there is no claim, the likelihood is \(1-\theta\). If \(C\) is the number of claims, namely \(C=\sum_{i=1}^N{Claim_i}\), and \(N-C\) is the number of policies without a claim, then the likelihood for all the data can be written as:

\[ \begin{aligned} Pr(D|\theta) &=\prod_1^{C}theta. \prod_1^{N-C}(1-theta)\\ &=\theta^C.(1-\theta)^{N-C}\\ \end{aligned} \]

The posterior distribution for \(\theta\) can then be determined from Bayes formula as: \[ \begin{aligned} \text{Posterior } \{\theta\} &= Pr (\theta | D)\\ &\propto \{\text{Likelihood}\}.\{\text{Prior}\}\\ &= \{\theta^C.(1-\theta)^{N-C}\}.\{1\}\\ &\propto Beta(C+1, N-C+1) \end{aligned} \] Therefore, the posterior is determined as: \(\theta|D \sim Beta(C+1, N-C+1)\)

To replicate this example in Greta, we simulate some data and solve the analysis as follows:

# Set up model and simulate data


# Simulate data
theta_true = 0.1
D = rbernoulli(n = N, p = theta_true)*1

The true \(\theta\) is theta_true. The simulated average claims rate is

## [1] 0.11

The Greta model is defined as follows. It requires three general steps:

  • Identify the data input
  • Specify the prior distribution
  • Specify the likelihood
  # Identify the data input for Greta using the as_data function

  # Define the prior for the theta parameter
  theta = uniform(min = 0, max = 1)  # theta ~ U(0,1)
  # Define the likehood for the model
  distribution(D) = bernoulli(theta)  # D ~ Bern(theta)

That’s the whole model.

We can now establish and compile the model:

  # Establish the model
  m= model(theta)

We run the model and extract the simulated posterior \(\theta|D\) parameters using Markov chain Monte Carlo (MCMC). MCMC is the numerical technique that is used to solve the Bayes formula for the desired parameters. See the Greta website on how to choose the chains, n_samples and warmup parameters.

n_samples = 1000; chains=5
S = n_samples * chains # Total number of simulations
draws=mcmc(m, n_samples = n_samples, warmup = 200,chains = chains)

Greta does not give an analytical distribution for \(\theta\) but instead provides a sample of draws from posterior. We now compare the Greta sample to the sample drawn from the true posterior.

As can be seen the “true” posterior samples drawn from the analytical distribution and the Greta posterior samples are virtually indistinguishable.

The posterior predictive distribution, \(Pr(Claim_{new}|D)\), derived from equation (ii) using the true analytic expression is:

\[ \begin{aligned} \Pr (Claims_{new} | D) &= \int_{\theta}{Pr (Claims_{new} | \theta).Pr(\theta| D)d\theta}\\ &\propto \int_{\theta}{\{\theta\}.\{\theta^C.(1-\theta)^{N-C}\}d\theta}\\ \end{aligned} \]

The right hand side of the expression is the formula for the mean of a \(Beta(C+1, N-C+1)\) distribution, namely \(\frac{C+1}{(C+1)+(N-C+1)}\) (see Wikipedia for the mean of a Beta Distribution). Therefore, using the formula for the mean of a Beta distribution, we get:

\(Pr (Claims_{new} | D) = \frac{C+1}{(C+1) +(N-C+1)} =\) 0.11765.

A 95% credible interval for the posterior claims probability can similary be found as:

round(c(lower_0.025 = qbeta(p = 0.025, C+1, N-C+1), upper_0.975=qbeta(p = 0.975, C+1, N-C+1)),4)
## lower_0.025 upper_0.975 
##      0.0629      0.1865

Now, to get the posterior predictive distribution for a claim using Greta, we use an approximate form of formula (ii). Firstly, from Greta, we have \(S\) draws from the posterior distribution of \(\theta\), namely \(\widetilde{\theta}_s=\theta_s|D\) for \(S=1,...,S\). Since each of the \(S\) draws of is equally likely, and by the Bernoulli formula: \(Pr(Claim_{new}|\widetilde{\theta})=\widetilde{\theta}\), then the PPD for a new claim is:

\(Pr(Claim_{new}) = \sum_s^S\widetilde{\theta}_s\frac{1}{S}\).

The posterior probability of a claim is therefore:

(posterior_prob = mean(unlist(draws)))
## [1] 0.1171895

We can also calculate a 95% credible interval for this probability as:

round(quantile(x = unlist(draws), probs = c(0.025, 0.975)),4)
##   2.5%  97.5% 
## 0.0625 0.1861

In summary, we see that the empirical solution from Greta provides a very close approximation to the true analytic solution for the posterior predictive probability of a claim.

Theoretically, by increasing the number of samples of \(\widetilde\theta_s\), we can make the posterior predictive distribution arbitrarily as accurate as we want.

Finally, note that the actual claims ratio \(\frac{C}{N}\) or 0.11 is slightly different from the posterior predictive probability. This arises because the prior for \(\theta\) has a small impact on the ultimate result.


In this post, I introduced the Bayesian approach for inferring risk rates and predicting future claims. We used a simple example that can be solved analytically and compared it to the approach in Greta, a probabilistic programming language. We showed how an empirical distribution can be used to estimate the posterior predictive distribution for future claims as well as calculate a credible interval around such probability.

Although this appears to be somewhat of a “toy” example, the key elements of a Bayesian analysis in an actuarial context are touched upon and the example provides a good template for future more complex models.

In future posts, I will expand the analysis to include a more advanced survival analysis model that can be used for many actuarial risk analyses. I will look at adding predictors to the analysis and calculate a multi-dimensional “surface” of risk rates rather than a single point estimate. We will also look in more detail at diagnostic testing of how accurate or inaccurate the MCMC approach is and we will attempt to use a more systematic approach to defining a prior distribution.