This tutorial describes how to analyze a simple coin-flipping model using treepplr.
We assume that the probability of obtaining heads with our coin is
p, and that we have a set of flips informing us about the
value of p.
In a Bayesian analysis of this problem, we need to specify a prior
probability distribution for the value of p. Here, we will
assume that p is drawn from a Beta(2,2)
distribution.
Load model and data files
Now we can load the coin model script (the probabilistic program describing the coin flipping model) provided with the treepplr package using the following command:
model <- tp_model("coin")You can load any TreePPL model script into treepplr using
the same command, replacing "coin" with the file name of
your script.
To view the coin model script, simply type
cat(model)which will print the script.
The main part of the model is defined in a function called
coinModel:
model function coinModel(coinflips: Bool[]) => Real {
// Uncomment if you want to test the input
//printLn("Input:");
//let coinStr = apply(bool2string, coinflips);
//printLn(join(coinStr));
assume p ~ Beta(2.0, 2.0); // prior
let n = length(coinflips);
for i in 1 to n {
flip(coinflips[i], p); // likelihood
}
return(p); // posterior
}
The definition of this function is preceded by the keywords
model function. All model scripts must have exactly one
model function.
The model function takes as input argument(s) the observed data that
we wish to condition on. In our case, the data are represented by a
sequence (vector or array) of Boolean
(TRUE/FALSE) values.
Note that TreePPL uses type annotation of input variables in the form
of <argument_name>: <data_type>. The square
brackets [] are used to denote a sequence type, that is,
coinflips: Bool[] tells us that the model function takes a
single argument by the name of coinflips, and the data type
is a sequence of Booleans.
The return type of a function is specified using the format
=> <data_type>. Our model function returns the
value of p. In general, the model function should return
the model parameters for which we are interested in inferring the
posterior distribution.
The model function starts with a few statements that are commented
out by having // put in front of them. The code in these
lines can be used to print the value of the coinflips
argument.
The statement assume p ~ Beta(2.0,2.0) specifies the
prior probability distribution for the parameter p in our
model. TreePPL provides a number of built-in probability distributions;
they all have names starting with a capital letter, like the
Beta distribution used here.
The assume statement is followed by a loop over the
input data sequence, conditioning the simulation on each observed value
(heads/TRUE or tails/FALSE). Specifically,
this is achieved by calling the help function flip.
The flip function is defined as follows:
function flip(datapoint: Bool, probability: Real) {
observe datapoint ~ Bernoulli(probability);
}
Bernoulli is a probability distribution on a binary
outcome space, represented as a Boolean variable taking the values
TRUE or FALSE. This is the appropriate
probability distribution for coin flipping.
The single parameter of the Bernoulli distribution,
called probability in the flip function, is
assumed by convention to be the probability of obtaining the outcome
TRUE. In our case, this would be the probability of
obtaining heads.
The observe statement weights the simulation with the
likelihood of observing a particlar data point from the Bernoulli
distribution.
This completes the TreePPL description of the coin flipping model. All TreePPL model descriptions essentially follow the same format. For a more complete coverage of the TreePPL language, see the language overview in the TreePPL online documentation.
Now we can start analyzing the model by inferring the value of
p given some observed sequence of coin flips. To do this,
we need to provide the observations in a suitable format. To load the
example data provided in the treepplr package, use:
data <- tp_data("coin")We can look at the structure of the input data using
str(data)
#> List of 1
#> $ coinflips: logi [1:20] FALSE TRUE TRUE TRUE FALSE TRUE ...The data are provided as an R list. Each element in the list is named
using the corresponding argument name expected by the TreePPL model
function. The arguments can be given in any order in the list. In our
case, the model function only takes one argument called
coinflips, so the R list only contains one element named
coinflips.
The TreePPL compiler requires data in JSON format. The conversion
from R variables to appropriate JSON code understood by TreePPL is done
automatically by treepplr for supported data types. For
instance, logical vectors in R are converted to sequences of Booleans
(Bool[]) in TreePPL.
Run TreePPL
Now we can compile and run the TreePPL program, inferring the
posterior distribution of p conditioned on the input data.
This is done using the tp_treeppl() function provided by
treepplr. The function has many optional arguments that allow
you to select among the inference methods supported by TreePPL, and
setting relevant options for each one of them. For an up-to-date
description of available inference strategies supported, see the
documentation of the tp_treeppl function.
Here, we will use the default settings for inference methods. It uses
sequential Monte Carlo (SMC), specifically the bootstrap particle filter
version (the method=smc-bpf option). It runs 100 sweeps
(100 SMC runs, if you wish), using 100 particles for each sweep
(n_runs=100, samples=100).
output_list <- tp_treeppl(model = model, data = data, samples = 100, n_runs = 100)The run should take a few seconds to complete depending on your machine.
In general, the TreePPL inference strategies can be described as nested approaches where the outermost shell defines the character of the inference output. If the outer shell is SMC, as is the case here, the returned object will be a nested R list. Each entry in the outermost list layer will correspond to a sweep. Each sweep will contain three values: the normalizing constant estimated in that sweep, each of the returned values from the model function (one returned value for each SMC particle), and the likelihood weight of each returned value (one weight for each particle).
Plot the posterior distribution
In SMC we can assess the quality of the inference by running several sweeps and comparing their normalizing constants to test if they generated similar estimates of the posterior distribution. A popular argument from the SMC literature suggests that the SMC estimate is accurate if the variance of the estimates of the normalizing constant across sweeps is lower than 1.
To check the variance of the normalizing constant, use the tp_smc_convergence() function.
tp_smc_convergence(output_list)
#> [1] 0.01126985It seems that our run provides a quite accurate estimate of the posterior distribution, as the variance is much smaller than 1.0.
It is also easy to plot the sampled values. The tp_parse() function helps us convert the results returned by TreePPL into an R list of appropriately weighted values, taking both the particle weights and normalizing constants (the sweep weights, if you wish) into account. Note that both the particle weights and normalizing constants are given in log units.
output <- tp_parse(output_list) %>%
dplyr::mutate(total_lweight = log_weight + norm_const) %>%
dplyr::mutate(norm_weight = exp(total_lweight - max(.$total_lweight)))
ggplot2::ggplot(output, ggplot2::aes(samples, weight = norm_weight)) +
ggplot2::geom_histogram(ggplot2::aes(y = ggplot2::after_stat(density)),
col = "white", fill = "lightblue", binwidth=0.01) +
ggplot2::geom_density() +
ggplot2::theme_bw()