Using PMwG for NOT psychology

This blog post uses the pmwg package for a more generic example than has previously been shown (here for example).

PMwG

The particle metropolis within Gibbs sampling package (pmwg) was created by a team at the University of Newcastle. The package provides functions to estimate Bayesian hierarchical models. What this means is;

  • Bayesian; probability based, sampling parameters to form a distribution of parameter values, not a point estimate. Gives us uncertainty

  • hierarchical; multi levels - i.e., a group level and a subject level. Think of a school (the group mean) and classes within the school (each with their own mean)

  • model-based sampling; propose lots of values for model parameters and find those which are probabilistically most liekly

So far, the package has mainly been used for psychological applications, where lots of data is collected per subject, and this data is then fit with complex models (which cannot be fit with standard methods). For an explanation of the full method see our other documentation or for the full maths, see here

PMwG_new

Recently, working together with Niek Stevenson, we’ve added functionality to the PMwG package. These are not detailed here, however, I do use a couple of these functions as they are pretty neat and usable. The additional functions can be seen here.

Why would we use this?

Well there’s a few reasons. For this example, using PMwG is probably overkill, but it makes a nice example for other situations where we might have many observations and many individuals. These bigger data problems represent more realistic scenarios, but importantly, we would want to get a sense of the group level parameter distribution, as well as the individual subjects’ data. For example, if we had a school with many classes (say 50), where each class included 50 students, then we could estimate a model for each class, which would inform the group level model.

The reason to use PMwG is when we have difficult estimation problems. Many models do not have analytic solutions, and so model estimation is often used to overcome this. Even with an analytic solution, doing hierarchical modelling can be challenging, which is where PMwG shows more usefulness. We could use optim to find the most likely values for a problem, but this is not reliable and only does calculations at one level (group or individual). Further, we could use another sampler (like stan or jags), however, with PMwG we can achieve greater efficiency, there is no reliance on multiple chains or convergence (due to sampler properties) and we can reliably estimate a more complex group level model – including the multivariate normal we use as default which allows us to estimate parameter correlations. The latter point is particularly useful for models with collinearity between parameters (i.e., as x goes up, so does y). I’ll do a blog in future showing just how bad post-hoc correlations between parameters are, and how PMwG overcomes this.

So that’s a lot of reasons to use this method, but lets summarise quickly.

  • We use PMwG to estimate statistical models (often with no analytic solution).
  • The models we would estimate are able to have their density calculated/approximated.
  • The data we model should have many observations (differs depending on model complexity) for a few different individuals (really want at least 20 to inform the group).
  • PMwG allows us to do hierarchical modelling. This means we have a group and individual level distribution.
  • PMwG is Bayesian. This means we can evaluate a level of certainty of our parameter estimates for neater inference.
  • PMwG is reliable and efficient.
  • PMwG allows us to estimate between parameter correlations.

The Data

In the following example, I use the the fastfood dataset from the openintro package. The data includes 515 item menus from 8 different fast food restaurants. Each item has 15 metrics to descirbe the nutritional value. Here we only use the first 11 metrics due to missing data (something I’ll address using PMwG in another blog post).

library(openintro)
data <- fastfood[,1:13]

Setup

First we need to source the pmwg_new code and set up the environment;

rm(list=ls())

# library(pmwg)
source("pmwg/variants/standard.R")
library(mvtnorm)
library(openintro)

Next, we need to clean the data. Here i include a subject column. This is a requirement of the pmwg package - you must include a subject column. Here, the restaurant column is labelled as the subject. This is because I’m going to treat each restaurant as an individual entity, where each food item is an “observation” from their menu.

I also remove the food item label for coding simplicity purposes. Next I’m going to omit any items with missing data, before finally rescaling four metrics for modelling purposes.

NOTE: These four rescaled metrics don’t need to be normalised, but the values are far greater than the other measures. Here, by dividing them all by 100, I don’t have to set different priors (it’s mainly because I’m being lazy).


names(data)[1]<-"subject"
data <- data[c(1,3:13)]
data <- na.omit(data)
data[,c(2,3,7,8)]<- data[,c(2,3,7,8)]/100

Here, we don’t have many observations, which makes it hard to estimate. We also probably should have more restaurants, but this serves as a nice intro example

Likelihood Function

The likelihood function is the engine in PMwG. The nicest part of this engine, is that you can craft it yourself, giving lots of flexibility. The likelihood function takes the proposed parameters and calculates the likelihood of these given the subjects’ (restaurants) data and the model you’re using.

Before we write this though, we’re first going to check the shapes of the distributions of nutrition metrics.


par(mfrow=c(3,4))
apply(data[,-1], 2, hist, main = NULL)

Two things are immediately apparent that need to be accounted for in the likelihood. First, all values are greater than 0. Second, we can see that all distributions are positively skewed. We’re going to make sure we do something about these in the likelihood.

Full Function

Shown below is the full likelihood function, but I’ll break it down slowly in this section;

ll <- function(x, data, sample = F ){
  x=exp(x)
  if(sample == FALSE){
    n_pars = length(x)
    tmpData = as.matrix(data[,-1])
    out = sum(TruncatedNormal::dtmvnorm(tmpData, x, diag(1, nrow = n_pars), lb = rep(0,n_pars), log = T, B = 1))
    return(out)
  }else{
    n_pars = length(x)
    Restaurant = data[,1]
    tmpData = as.matrix(data[,-1])
    out = TruncatedNormal::rtmvnorm(nrow(tmpData), x, diag(1, nrow = n_pars), lb = rep(0,n_pars))
    out <- as.data.frame(cbind(out,Restaurant))
    return(out)
  }
  
}

In the likelihood function, we need to include three main arguments;

  • x is the vector of proposed (named) parameter values
  • data is the data (note that each time using this will only use data for the one subject being estimated)
  • sample = FALSE is used to specify whether we are sampling from the model (i.e., TRUE) or calculating the density ( FALSE - set as the default)

Transformations

The first thing I do in the likelihood is transform the input parameter values. Knowing that all the values in the data are positive, and therefore any proposed parameters will be positive (using a multivariate normal model), I need to ensure that the parameters are also positive.

Taking the exponent of the proposed parameters means these will be strictly positive. This has implications for the implied prior, but I’ll discuss this in the advanced section later on.

The Model

For this example, I was considering using a multivariate normal model, but I found a truncated multivariate normal function that makes it slightly more interesting and accurate (we’ll do some model comparison below too). When thinking about the “model” we’re using, we’re talking about the model of the data describing the data for each individual.

Looking at the group level distribution above, we can see a skewed distribution, but it’s worthwhile thinking about this distribution per individual restaurant as well. This means we make the assumption that each individual follows this model, and that informs a population level model (and can allow us to generate or predict other restaurants).

For the truncated multivariate model I use the dtmvnorm and rtmvnorm functions from the TruncatedNormal package for density calculation and random generation respectively.

Calculating the Density

Here I say that we calculate the likelihood of the data given some parameter values. All I’m doing here is testing the probability of parameters for each persons data. Before I show the functions, I like to do a small thought exercise to show how this works in practice.

Thought exercise for likelihoods

For this example, let’s generate some data from a normal distribution. Here I make some “parameter” values which I’ll save. These parameters are just mean and standard deviation for the normal.

truePars = c("mean"=6,"sd"=.3)
exampleData <- rnorm(n=1000, mean = truePars["mean"], sd= truePars["sd"])

Now, using this data, I’ll try and input some other parameters to see how likely these are for the data;

# first try
sum(dnorm(exampleData,mean=0,sd=1, log = T))
## [1] -18966.2
# a lot off, so lets shift the mean
sum(dnorm(exampleData,mean=7,sd=1, log = T))
## [1] -1460.141
# what if we move the sd
sum(dnorm(exampleData,mean=6,sd=.5, log = T))
## [1] -394.0612
# now lets try them together
sum(dnorm(exampleData,mean=6.5,sd=.3, log = T))
## [1] -1566.464
# very close
sum(dnorm(exampleData,mean=8,sd=.4, log = T))
## [1] -12754.75
# too far
sum(dnorm(exampleData,mean = truePars["mean"], sd= truePars["sd"], log = T))
## [1] -182.3819
# nice

We can see how the probability increases as we get closer to the generating values. In PMwG, this process is taken care of under the hood. Hopefully this shows just what the algorithm is trying to do. This is also a good test for your model, where likelihood should increase with “better” test values. It’s also a good idea to do simulation and recovery to make sure your model “works”. This means generating values and then fitting the model with the goal of obtaining the generated values. If it doesn’t recover, there could be something wrong with the code, not enough data per condition/individual or the model does not recover.

Density function

The density function is shown below;

if(sample == FALSE){
    n_pars = length(x)
    tmpData = as.matrix(data[,-1])
    out = sum(TruncatedNormal::dtmvnorm(tmpData, x, diag(1, nrow = n_pars), lb = rep(0,n_pars), log = T, B = 1)) #B=1 because otherwise it does many iterations and takes too long
    return(out)
  }

There are four steps in the getting the likelihood. First I grab the number of parameters. Second, I remove the names from the data for coding simplicity. Then I do the important part. Finally, I return a single value (the likelihood).

For the important part here, I take the sum of all the densities as they are on the log scale (otherwise I would multiply). Using the dtmvnorm function, I include the data, where each row is an observation and each column is a different metric. The means are given by the parameters (x). I use an sd of 1 for the covariance. I also include a lower bound of 0 for all metrics.

Importantly, all I’m doing here is checking how well the parameters fit the data, just as in the thought exercise above. This means that the density is calculated for each data point (i.e., each observation or food item), and then multiplied together (summed here because I use logs for simplicity).

Random generation

In a very similar sense, to generate random data, I use the inverse functionality.

else{
    n_pars = length(x)
    Restaurant = data[,1] #remove the name column
    tmpData = as.matrix(data[,-1])
    out = TruncatedNormal::rtmvnorm(nrow(tmpData), x, diag(1, nrow = n_pars), lb = rep(0,n_pars))
    out <- as.data.frame(cbind(out,Restaurant)) #add the name column back on
    return(out)
  }

What this allows us to do is generate data, given some input parameters for a single subject.

PMwG Functions

Now I’ve done all the setup, I just need to prepare the sampler. To do this, first we need to specify the vector of parameter names. These will be used by the sampler.

pars=names(data)[-1]

Priors

Next I need to specify some priors. This can be daunting and stressful. But it really doesn’t need to be, especially with PMwG. The prior specification here is for the group level model. PMwG assumes that the group level distribution of parameters (i.e., the means for each restaurant - NOT the truncated model) are distributed according to a multivariate normal distribution. For this prior, we only need to set the mean and the variance.

The prior influences which parameter values are proposed. Here, I set the mean to 0 and variance to 10. Mean to 0?? Yes, because proposals come from a range of -Inf to Inf. In our likelihood we take care of these by taking the exponential of them, so really, the mean here is exp(0) = 1. I set the variance to 10 so that a wide range of values are used, but given our exponential transformation, I could probably keep this at 1.

For the rest of PMwG priors, we don’t need to consider the prior shapes of distributions of group level parameters as many researchers stress about with other fitting methods.

priors <- list(
  theta_mu_mean = rep(0, length(pars)),
  theta_mu_var = diag(rep(10, length(pars)))) 

The sampler

Finally, we can set up the sampler and run the algorithm. To set up the sampler, we use the pmwgs function, where we need to include data,pars,priors and ll_func. Then we can intialise the sampler.

Here, the data must have a subject column so that the data can be split into data for each individual. The pars arguments should include any parameters used in the model to be estimated. Any prior should have a theta_mu_mean vector and a theta_mu_var matrix (mean and variance) of length pars and pars * pars respectively.

# Create the Particle Metropolis within Gibbs sampler object ------------------

sampler <- pmwgs(
  data = data,
  pars = pars,
  prior = priors,
  ll_func = ll
)

sampler <- init(sampler) # i don't use any start points here

Sampling

When this is done, we can finally do the sampling!! For sampling, there are three stages. The specifics of these can be seen here. the tldr version of this is;

  • burn used for initial burn-in samples (the throwaway samples) to get to the posterior distribution
  • adapt used to create the efficient distribution in sampling. This stage will stop if the efficient distribution can be created.
  • sample efficiently sample from the posterior

Note that you don’t need to run each stage, you might just run burn-in and throw away the initial samples. However, we’ve seen significant improvements in the quality of samples from the final efficient stage.

For the run_stage argument, we need to include the sampler object, the stage we’re running, the number of iterations, particles and computer cores. For iterations and particles, this could take some fine tuning for the user, however, we tend to use 1000 burn-in iterations, 5000 adaptation and then however many desired samples for the final stage. We generally use either 100 or 300 particles depending on the model complexity. Here I use the pmwg_new function pstar. This adapts the “acceptance rate” - basically making the sampler more efficient. For more info on all the arguments, see here.

In this example, the model is simple and easy to estimate, so I dial back some of these tuning parameters. To know the best settings to use, you might start with something like below and then tinker as you see fit.

sampled <- run_stage(sampler, stage = "burn",iter = 100, particles = 100, n_cores = 4, pstar=.6) #epsion will set automatically to 0.5
sampled <- run_stage(sampled, stage = "adapt",iter = 500, particles = 100, n_cores = 4, pstar=.6) 
sampled <- run_stage(sampled, stage = "sample",iter = 100, particles = 50, n_cores = 4, pstar=.6) 

NOTE: This will take a few minutes to run. Increasing the number of particles or iterations will slow the sampler down. However, we need to reach the posterior (i.e., the good values for the parameters), and to do this we need good information. This means that we may need more particles or iterations. Ideally, for efficiency, we should aim for between 30 and 80% acceptance rate. Here, pstar helps as I can specify a desired acceptance rate.

Output

Once our sampler is finished, we can check the output.

Checking

First, we should check that we are in the posterior. To do this, we can check the chain plots of the samples. Here we show two plots. The first is the group level parameter chains and the second is the individual subject likelihood chains. For the parameter chains, we’re looking for “stationarity” in the samples. This means that the sampled values will show a flat line with a bit of wiggle (like a hairy caterpillar). For the subject likelihood, we’re looking for flat lines as well, however, we should see an increase in these from the start to the end. These will be less wiggly. See here for more details on this output.

tmp <- sampled
matplot(t(tmp$samples$theta_mu),type="l")

matplot(t(tmp$samples$subj_ll),type="l")

Here we can see that some parameters only reach the posterior in the final sampling stage (x-axis is iterations of sampling). We can see however, that stationarity is reached. Secondly, we can see that the likelihood increases and then stabilizes at the highest value for each restaurant.

Posterior Inference

Next, we can look at the sampled values. Here I show group level and inidvidual restaurant level parameter values. Remember that by model sampling, we generate distributions of possible parameter values, so here, I take the median of these values. I also transform these as I did in the likelihood.

groupMedian <- apply(tmp$samples$theta_mu[,tmp$samples$stage=="sample"],1,median)
round(exp(groupMedian),3)
##    calories     cal_fat   total_fat     sat_fat   trans_fat cholesterol 
##       4.842       0.001      23.322       6.782       0.000       0.816 
##      sodium  total_carb       fiber       sugar     protein 
##      12.661      41.485       3.470       6.421      28.770

restaurantMedian <- apply(tmp$samples$alpha[,,tmp$samples$stage=="sample"],1:2, median)
round(exp(restaurantMedian),3)
##             Mcdonalds Chick Fil-A  Sonic  Arbys Burger King Dairy Queen Subway
## calories        5.904       3.870  5.600  4.872       5.672       5.107  4.952
## cal_fat         0.000       0.006  0.000  0.006       0.000       0.003  0.000
## total_fat      31.947      14.475 37.683 26.708      34.755      29.038 18.584
## sat_fat         8.204       3.186 10.982  8.050      10.414      10.415  5.950
## trans_fat       0.000       0.000  0.000  0.004       0.000       0.000  0.002
## cholesterol     1.541       0.596  1.180  0.900       0.884       0.819  0.737
## sodium         14.550      10.735 13.711 15.157      11.673      11.413 12.802
## total_carb     48.858      25.488 47.153 44.797      37.188      38.741 54.857
## fiber           3.394       2.687  2.972  3.022       2.358       3.418  6.606
## sugar          11.041       3.834  6.320  7.477       7.713       6.219 10.094
## protein        40.386      30.841 29.280 29.165      29.285      24.947 30.244
##             Taco Bell
## calories        4.597
## cal_fat         0.000
## total_fat      20.861
## sat_fat         6.409
## trans_fat       0.000
## cholesterol     0.506
## sodium         10.164
## total_carb     46.599
## fiber           5.707
## sugar           3.721
## protein        17.390

We can see the mean values overall, but that’s not overly meaningful. The more meaningful estimates are at the individual level.

We can also look at the distribution of the group parameter values and see how restaurants compare to the group distribution. I’ve only shown two parameters here, but this is a nice example.

library(tidyr)
library(ggplot2)

eg <- apply(tmp$samples$alpha[c(1,10),,tmp$samples$stage=="sample"],1:2, median)
eg<-round(exp(eg),3)
group <- exp(tmp$samples$theta_mu[c(1,10),tmp$samples$stage=="sample"])
group <- as.data.frame(t(group))
eg <- as.data.frame(t(eg))
eg$restaurant <- row.names(eg)
group <- pivot_longer(group, cols = everything(), names_to = "metric", values_to = "value")
eg <- pivot_longer(eg, cols = c("calories","sugar"), names_to = "metric", values_to = "value")

# add vertical mean line to density plot with geom_vline()
ggplot(group, aes(x=value)) +
  geom_density( fill="dodgerblue", alpha=0.5)+
  geom_vline(data=eg,aes(xintercept=value,colour = restaurant), size=.5)+
  geom_text(data=eg,aes(x=value,label = restaurant,y=1), angle=90)+  
  theme_bw()+
  facet_grid(~metric, scales = "free")

Anything unexpected here?

Correlations

Finally, one strength of PMwG is that it assumes a group level multivariate model, with variance given by a covariance matrix. This means we can see correlations between parameters of the model - something that can be very useful with models with multiple components. An example of this is shown below;

cov<-apply(tmp$samples$theta_var[,,tmp$samples$stage=="sample"],1:2, median)
colnames(cov)<-tmp$par_names
rownames(cov)<-tmp$par_names
cor<-cov2cor(cov) #correlation matrix
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor, method="circle", type = "lower", title = "Parameter Correlations", tl.col = "black",mar=c(0,0,2,0))

Model Comparison

Let’s say we wanted to compare our model (the truncated normal) to something else (standard multivariate normal). How could we do this??

First, let’s write the likelihood function;

ll <- function(x, data, sample = F ){
  x=exp(x)
  if(sample == FALSE){
    n_pars = length(x)
    tmpData = as.matrix(data[,-1])
    out = sum(mvtnorm::dmvnorm(tmpData, x, diag(1, nrow = n_pars), log = T)) # only change is here to the model
    return(out)
  }else{
    n_pars = length(x)
    Restaurant = data[,1]
    tmpData = as.matrix(data[,-1])
    out = mvtnorm::rmvnorm(nrow(tmpData), x, diag(1, nrow = n_pars), log = T)
    out <- as.data.frame(cbind(out,Restaurant))
    return(out)
  }
  
}

I ran this model using the exact same code (apart from the likelihood), but I’ve included it below;

rm(list=ls())

# library(pmwg)

source("pmwg/variants/standard.R")
library(TruncatedNormal)
library(openintro)

data <- fastfood[,1:13]
names(data)[1]<-"subject"
data <- data[c(1,3:13)]
data <- na.omit(data)
data[,c(2,3,7,8)]<- data[,c(2,3,7,8)]/100


## plots

par(mfrow=c(4,3))
apply(data[,-1], 2, hist, main = NULL)
dev.off()


### likelihood
ll <- function(x, data, sample = F ){
  x=exp(x)
  if(sample == FALSE){
    n_pars = length(x)
    tmpData = as.matrix(data[,-1])
    out = sum(mvtnorm::dmvnorm(tmpData, x, diag(1, nrow = n_pars), log = T)) # only change is here to the model
    return(out)
  }else{
    n_pars = length(x)
    Restaurant = data[,1]
    tmpData = as.matrix(data[,-1])
    out = mvtnorm::rmvnorm(nrow(tmpData), x, diag(1, nrow = n_pars), log = T)
    out <- as.data.frame(cbind(out,Restaurant))
    return(out)
  }
  
}

pars=names(data)[-1]


priors <- list(
  theta_mu_mean = rep(0, length(pars)),
  theta_mu_var = diag(rep(10, length(pars)))) 


# Create the Particle Metropolis within Gibbs sampler object ------------------

sampler <- pmwgs(
  data = data,
  pars = pars,
  prior = priors,
  ll_func = ll
)

sampler <- init(sampler) # i don't use any start points here

sampled <- run_stage(sampler, stage = "burn",iter = 100, particles = 100, n_cores = 4, pstar=.6) #epsion will set automatically to 0.5
sampled <- run_stage(sampled, stage = "adapt",iter = 500, particles = 100, n_cores = 4, pstar=.6) 
sampled <- run_stage(sampled, stage = "sample",iter = 200, particles = 50, n_cores = 4, pstar=.6) 

Now we have our two sampled objects, we can do model comparison. For some advanced model comparison methods, see here. For now though, we’re going to compare using a criterion method. Here, the smaller value represents the more likely model. The function for the deviance information criterion (DIC) is below

pmwg.DIC=function(sampled,pD=FALSE){
  nsubj=length(unique(sampled$data$subject))
  
  # the mean likelihood of the overall (sampled-stage) model, separately for each subject
  mean.like <- apply(sampled$samples$subj_ll[,sampled$samples$stage=="sample"],1,mean)
  
  # the mean of each parameter across iterations. Keep dimensions for parameters and subjects
  mean.params <- t(apply(sampled$samples$alpha[,,sampled$samples$stage=="sample"],1:2,mean))
  
  # i name mean.params here so it can be used by the log_like function
  colnames(mean.params)<-sampled$par_names
  
  # log-likelihood for each subject using their mean parameter vector
  mean.params.like <- numeric(ncol(mean.params))
  data <- transform(sampled$data, subject=match(subject, unique(subject)))
  for (j in 1:nsubj) {
    mean.params.like[j] <- sampled$ll_func(mean.params[j,], data=data[data$subject==j,], sample=FALSE)
  }
  
  # Effective number of parameters
  pD <- sum(-2*mean.like + 2*mean.params.like)
  
  # Deviance Information Criterion
  DIC <- sum(-4*mean.like + 2*mean.params.like)
  
  if (pD){
    return(c("DIC"=DIC,"effective parameters"=pD))
  }else{
    return(DIC)
  }
    
}

Now all we need to do is load our two sampled objects and compare!

## Warning in -2 * mean.like + 2 * mean.params.like: longer object length is not a
## multiple of shorter object length
## Warning in -4 * mean.like + 2 * mean.params.like: longer object length is not a
## multiple of shorter object length
##                  DIC effective parameters 
##            1105994.4             231328.4
## Warning in -2 * mean.like + 2 * mean.params.like: longer object length is not a
## multiple of shorter object length

## Warning in -2 * mean.like + 2 * mean.params.like: longer object length is not a
## multiple of shorter object length
##                  DIC effective parameters 
##            1115781.3             231675.7
# First load in the sampled object for the truncated multivariate normal model
load("fastFood_samples_tmvn.RData")
# test
pmwg.DIC(sampled) 

# First load in the sampled object for the truncated multivariate normal model
load("fastFood_samples_mvn.RData")
# test
pmwg.DIC(sampled)

Evidently, we can see that the smaller DIC value was for our truncated normal model - so we made a good choice picking this one first ;)

Final Word

In this data set, there are only 8 restaurants. Usually for a hierarchical model, we would want at least 20 individuals. Additionally, there are very few observations per restaurant. Similar to the group level, we want a well informed individual level model. The model still works, but because of the low amount of data, there is greater uncertainty.

So it’s not the best data set, but hey, it makes for a nice example. I’ve also been brief with a few explanations on specifics, the algorithm and some function arguments, but this is for brevity (and because they’re generally not super important). If you want to know more about specifics, please see the bookdown. Hopefully for users outside of psychology (or even outside of decision making modelling), this example can help contextualize the bookdown.

Happy coding!

Reilly

Reilly Innes
Reilly Innes
Behaviour and Cognition Specialist

Researching human-cyber interactions. Bringing behavioural insights to the world of cybersecurity.