Skip to contents

This tutorial describes how to analyze a simple coin-flipping model using treepplr.

We assume that the probability of obtaining heads with our coin is p, and that we have a set of flips informing us about the value of p.

In a Bayesian analysis of this problem, we need to specify a prior probability distribution for the value of p. Here, we will assume that p is drawn from a Beta(2,2) distribution.

Load the required R packages

First load the required R packages:

Load model and data files

Now we can load the coin model script (the probabilistic program describing the coin flipping model) provided with the treepplr package using the following command:

model <- tp_model("coin")

You can load any TreePPL model script into treepplr using the same command, replacing "coin" with the file name of your script.

To view the coin model script, simply type

cat(model)

which will print the script.

The main part of the model is defined in a function called coinModel:

model function coinModel(coinflips: Bool[]) => Real  {
  // Uncomment if you want to test the input
  //printLn("Input:");
  //let coinStr = apply(bool2string, coinflips);
  //printLn(join(coinStr));
  assume p ~ Beta(2.0, 2.0); // prior
  let n = length(coinflips);
  for i in 1 to n {
    flip(coinflips[i], p); // likelihood
  }
  return(p); // posterior
}

The definition of this function is preceded by the keywords model function. All model scripts must have exactly one model function.

The model function takes as input argument(s) the observed data that we wish to condition on. In our case, the data are represented by a sequence (vector or array) of Boolean (TRUE/FALSE) values.

Note that TreePPL uses type annotation of input variables in the form of <argument_name>: <data_type>. The square brackets [] are used to denote a sequence type, that is, coinflips: Bool[] tells us that the model function takes a single argument by the name of coinflips, and the data type is a sequence of Booleans.

The return type of a function is specified using the format => <data_type>. Our model function returns the value of p. In general, the model function should return the model parameters for which we are interested in inferring the posterior distribution.

The model function starts with a few statements that are commented out by having // put in front of them. The code in these lines can be used to print the value of the coinflips argument.

The statement assume p ~ Beta(2.0,2.0) specifies the prior probability distribution for the parameter p in our model. TreePPL provides a number of built-in probability distributions; they all have names starting with a capital letter, like the Beta distribution used here.

The assume statement is followed by a loop over the input data sequence, conditioning the simulation on each observed value (heads/TRUE or tails/FALSE). Specifically, this is achieved by calling the help function flip.

The flip function is defined as follows:

function flip(datapoint: Bool, probability: Real) {
  observe datapoint ~ Bernoulli(probability);
}

Bernoulli is a probability distribution on a binary outcome space, represented as a Boolean variable taking the values TRUE or FALSE. This is the appropriate probability distribution for coin flipping.

The single parameter of the Bernoulli distribution, called probability in the flip function, is assumed by convention to be the probability of obtaining the outcome TRUE. In our case, this would be the probability of obtaining heads.

The observe statement weights the simulation with the likelihood of observing a particlar data point from the Bernoulli distribution.

This completes the TreePPL description of the coin flipping model. All TreePPL model descriptions essentially follow the same format. For a more complete coverage of the TreePPL language, see the language overview in the TreePPL online documentation.

Now we can start analyzing the model by inferring the value of p given some observed sequence of coin flips. To do this, we need to provide the observations in a suitable format. To load the example data provided in the treepplr package, use:

data <- tp_data("coin")

We can look at the structure of the input data using

str(data)
#> List of 1
#>  $ coinflips: logi [1:20] FALSE TRUE TRUE TRUE FALSE TRUE ...

The data are provided as an R list. Each element in the list is named using the corresponding argument name expected by the TreePPL model function. The arguments can be given in any order in the list. In our case, the model function only takes one argument called coinflips, so the R list only contains one element named coinflips.

The TreePPL compiler requires data in JSON format. The conversion from R variables to appropriate JSON code understood by TreePPL is done automatically by treepplr for supported data types. For instance, logical vectors in R are converted to sequences of Booleans (Bool[]) in TreePPL.

Run TreePPL

Now we can compile and run the TreePPL program, inferring the posterior distribution of p conditioned on the input data. This is done using the tp_treeppl() function provided by treepplr. The function has many optional arguments that allow you to select among the inference methods supported by TreePPL, and setting relevant options for each one of them. For an up-to-date description of available inference strategies supported, see the documentation of the tp_treeppl function.

Here, we will use the default settings for inference methods. It uses sequential Monte Carlo (SMC), specifically the bootstrap particle filter version (the method=smc-bpf option). It runs 100 sweeps (100 SMC runs, if you wish), using 100 particles for each sweep (n_runs=100, samples=100).

output_list <- tp_treeppl(model = model, data = data, samples = 100, n_runs = 100)

The run should take a few seconds to complete depending on your machine.

In general, the TreePPL inference strategies can be described as nested approaches where the outermost shell defines the character of the inference output. If the outer shell is SMC, as is the case here, the returned object will be a nested R list. Each entry in the outermost list layer will correspond to a sweep. Each sweep will contain three values: the normalizing constant estimated in that sweep, each of the returned values from the model function (one returned value for each SMC particle), and the likelihood weight of each returned value (one weight for each particle).

Plot the posterior distribution

In SMC we can assess the quality of the inference by running several sweeps and comparing their normalizing constants to test if they generated similar estimates of the posterior distribution. A popular argument from the SMC literature suggests that the SMC estimate is accurate if the variance of the estimates of the normalizing constant across sweeps is lower than 1.

To check the variance of the normalizing constant, use the tp_smc_convergence() function.

tp_smc_convergence(output_list)
#> [1] 0.01126985

It seems that our run provides a quite accurate estimate of the posterior distribution, as the variance is much smaller than 1.0.

It is also easy to plot the sampled values. The tp_parse() function helps us convert the results returned by TreePPL into an R list of appropriately weighted values, taking both the particle weights and normalizing constants (the sweep weights, if you wish) into account. Note that both the particle weights and normalizing constants are given in log units.

output <-  tp_parse(output_list) %>% 
  dplyr::mutate(total_lweight = log_weight + norm_const) %>% 
  dplyr::mutate(norm_weight = exp(total_lweight - max(.$total_lweight)))

ggplot2::ggplot(output, ggplot2::aes(samples, weight = norm_weight)) +
  ggplot2::geom_histogram(ggplot2::aes(y = ggplot2::after_stat(density)),
                 col = "white", fill = "lightblue", binwidth=0.01) +
  ggplot2::geom_density() +
  ggplot2::theme_bw()