Publishable Stuff

Rasmus Bååth's Blog


Hello stranger, and welcome! 👋😊
I'm Rasmus Bååth, data scientist, engineering manager, father, husband, tinkerer, tweaker, coffee brewer, tea steeper, and, occasionally, publisher of stuff I find interesting down below👇


The Bayesian Counterpart of Pearson’s Correlation Test

2013-08-19

Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).

Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.

The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points $[[x_{1,1},x_{1,2}],[x_{2,1},x_{2,2}],[x_{3,1},x_{3,2}],…]$ then the $x_{i,1} \text{s}$ and the $x_{i,2} \text{s}$ are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :

A bivariate normal distribution

Read on →

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part three.

2013-08-05

In part one and part two of Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model I developed a model for the number of goals in football matches from five seasons of La Liga, the premier Spanish football league. I’m now reasonably happy with the model and want to use it to rank the teams in La Liga and to predict the outcome of future matches!

Ranking the Teams of La Liga

We’ll start by ranking the teams of La Liga using the estimated skill parameters from the 2012/2013 season. The values of the skill parameters are difficult to interpret as they are relative to the skill of the team that had its skill parameter “anchored” at zero. To put them on a more interpretable scale I’ll first zero center the skill parameters by subtracting the mean skill of all teams, I then add the home baseline and exponentiate the resulting values. These rescaled skill parameters are now on the scale of expected number of goals when playing home team. Below is a caterpillar plot of the median of the rescaled skill parameters together with the 68 % and 95 % credible intervals. The plot is ordered according to the median skill and thus also gives the ranking of the teams.

# The ranking of the teams for the 2012/13 season.
team_skill <- ms3[, str_detect(string = colnames(ms3), "skill\\[5,")]
team_skill <- (team_skill - rowMeans(team_skill)) + ms3[, "home_baseline[5]"]
team_skill <- exp(team_skill)
colnames(team_skill) <- teams
team_skill <- team_skill[, order(colMeans(team_skill), decreasing = T)]
par(mar = c(2, 0.7, 0.7, 0.7), xaxs = "i")
caterplot(team_skill, labels.loc = "above", val.lim = c(0.7, 3.8))

plot of chunk unnamed-chunk-22

Two teams are clearly ahead of the rest, FC Barcelona and Real Madrid CF. Let’s look at the credible difference between the two teams.

Read on →

CogSci 2013 in Berlin was Hot!

2013-08-03

One month ago I went to UseR 2013 in Albacete, Spain wich was a great conference featuring 33° C and many interesting presentations. Surpisingly CogSci 2013 in Berlin, which I attended this week, was even warmer, outside it is currenly 36° C! There were also many interesting presentations at CogSci 2013 with some of the highlights being: The symposia on Music Cognition inlcuding a talk by Edward Large on how to model pitch and harmony perception using dynamical systems.
Read on →

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part two.

2013-07-28

In the last blog post I showed my initial attempt at modeling football results in La Liga using a Bayesian Poission model, but there was one glaring problem with the model; it did not consider the advantage of being the home team. In this post I will show how to fix this! I will also show a way to deal with the fact that the dataset covers many La Liga seasons and that teams might differ in skill between seasons.

Modeling Match Results: Iteration 2

The only change to the model needed to account for the home advantage is to split the baseline into two components, a home baseline and an away baseline. The following JAGS model implements this change by splitting baseline into home_baseline and away_baseline.

# model 2
m2_string <- "model {
for(i in 1:n_games) {
  HomeGoals[i] ~ dpois(lambda_home[HomeTeam[i],AwayTeam[i]])
  AwayGoals[i] ~ dpois(lambda_away[HomeTeam[i],AwayTeam[i]])
}

for(home_i in 1:n_teams) {
  for(away_i in 1:n_teams) {
    lambda_home[home_i, away_i] <- exp( home_baseline + skill[home_i] - skill[away_i])
    lambda_away[home_i, away_i] <- exp( away_baseline + skill[away_i] - skill[home_i])
  }
}

skill[1] <- 0 
for(j in 2:n_teams) {
  skill[j] ~ dnorm(group_skill, group_tau)
}

group_skill ~ dnorm(0, 0.0625)
group_tau <- 1/pow(group_sigma, 2)
group_sigma ~ dunif(0, 3)

home_baseline ~ dnorm(0, 0.0625)
away_baseline ~ dnorm(0, 0.0625)
}"

Another cup of coffee while we run the MCMC simulation…

Read on →

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part one.

2013-07-21

Update: This series of posts are now also available as a technical report:
Bååth, R. (2015), Modeling Match Results in Soccer using a Hierarchical Bayesian Poisson Model. LUCS Minor, 18. (ISSN 1104-1609) ( pdf)

This is a slightly modified version of my submission to the UseR 2013 Data Analysis Contest which I had the fortune of winning :) The purpose of the contest was to do something interesting with a dataset consisting of the match results from the last five seasons of La Liga, the premium Spanish football (aka soccer) league. In total there were 1900 rows in the dataset each with information regarding which was the home and away team, what these teams scored and what season it was. I decided to develop a Bayesian model of the distribution of the end scores. Here we go…

Ok, first I should come clean and admit that I know nothing about football. Sure, I’ve watched Sweden loose to Germany in the World Cup a couple of times, but that’s it. Never the less, here I will show an attempt to to model the goal outcomes in the La Liga data set provided as part of the UseR 2013 data analysis contest. My goal is not only to model the outcomes of matches in the data set but also to be able to (a) calculate the odds for possible goal outcomes of future matches and (b) to produce a credible ranking of the teams. The model I will be developing is a Bayesian hierarchical model where the goal outcomes will be assumed to be distributed according to a Poisson distribution. I will focus more on showing all the cool things you can easily calculate in R when you have a fully specified Bayesian Model and focus less on model comparison and trying to find the model with highest predictive accuracy (even though I believe my model is pretty good). I really would like to see somebody try to do a similar analysis in SPSS (which most people uses in my field, psychology). It would be a pain!

This post assumes some familiarity with Bayesian modeling and Markov chain Monte Carlo. If you’re not into Bayesian statistics you’re missing out on something really great and a good way to get started is by reading the excellent Doing Bayesian Data Analysis by John Kruschke. The tools I will be using is R (of course) with the model implemented in JAGS called from R using the rjags package. For plotting the result of the MCMC samples generated by JAGS I’ll use the coda package, the mcmcplots package, and the plotPost function courtesy of John Kruschke. For data manipulation I used the plyr and stringr packages and for general plotting I used ggplot2. This report was written in Rstudio using knitr and xtable.

I start by loading libraries, reading in the data and preprocessing it for JAGS. The last 50 matches have unknown outcomes and I create a new data frame d holding only matches with known outcomes. I will come back to the unknown outcomes later when it is time to use my model for prediction.

Read on →

useR 2013 was a blast!

2013-07-14

I had a great time at useR 2013 in Albacete, Spain. The food was great, the people were fun and the weather was hot. A pleasant surprise was that I won the useR data analysis contest with my submission “Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model.” It was a fun exercise modeling football scores and I might write a post about it later, for now you can download my original submission here if you’re interested.
Read on →

Three Ways to Run Bayesian Models in R

2013-06-23

There are different ways of specifying and running Bayesian models from within R. Here I will compare three different methods, two that relies on an external program and one that only relies on R. I won’t go into much detail about the differences in syntax, the idea is more to give a gist about how the different modeling languages look and feel. The model I will be implementing assumes a normal distribution with fairly wide priors:

$$y_i \sim \text{Normal}(\mu,\sigma)$$

$$\mu \sim \text{Normal}(0,100)$$

$$\sigma \sim \text{LogNormal}(0,4)$$

Let’s start by generating some normally distributed data to use as example data in the models.

set.seed(1337)
y <- rnorm(n = 20, mean = 10, sd = 5)
mean(y)
## [1] 12.72
sd(y)
## [1] 5.762

Method 1: JAGS

JAGS (Just Another Gibbs Sampler) is a program that accepts a model string written in an R-like syntax and that compiles and generate MCMC samples from this model using Gibbs sampling. As in BUGS, the program that inspired JAGS, the exact sampling procedure is chosen by an expert system depending on how your model looks. JAGS is conveniently called from R using the rjags package and John Myles White has written a nice introduction to using JAGS and rjags. I’ve seen it recommended to put the JAGS model specification into a separate file but I find it more convenient to put everything together in one .R file by using the textConnection function to avoid having to save the model string to a separate file. Now, the following R code specifies and runs the model above for 20000 MCMC samples:

library(rjags)

# The model specification
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, tau)
  }
  mu ~ dnorm(0, 0.0001)
  sigma ~ dlnorm(0, 0.0625)
  tau <- 1 / pow(sigma, 2)
}"

# Running the model
model <- jags.model(textConnection(model_string), data = list(y = y), n.chains = 3, n.adapt= 10000)
update(model, 10000); # Burnin for 10000 samples
mcmc_samples <- coda.samples(model, variable.names=c("mu", "sigma"), n.iter=20000)
Read on →

Bayesian estimation supersedes the t-test - Online

2013-06-20

The t-test is probably the most used and abused statistical test in research [ Citation Needed]. The t-test is used when you have two groups of measures and you want to look at the difference between the groups and the original t-test assumes that the measures are approximately normally distributed and that the two groups have equal variance. While this last assumption is relaxed by Welch’s t-test a large problem remains: The t-test isn’t Bayesian :)
Read on →

Bayesian Modeling of Anscombe’s Quartet

2013-06-20

Anscombe’s quartet is a collection of four datasets that look radically different yet result in the same regression line when using ordinary least square regression. The graph below shows Anscombe’s quartet with imposed regression lines (taken from the Wikipedia article).

Ancombe&rsquo;s Quartet

While least square regression is a good choice for dataset 1 (upper left plot) it fails to capture the shape of the other three datasets. In a recent post John Kruschke shows how to implement a Bayesian model in JAGS that captures the shape of both data set 1 and 3 (lower left plot). Here I will expand that model to capture the shape of all four data sets. If that sounds interesting start out by reading Kruschke’s post and I will continue where that post ended…

Ok, using a wide tailed t distribution it was possible to down weight the influence of the outlier in dataset 3 while still capturing the shape of dataset 1. It still fails to capture datasets 2 and 4 however. Looking at dataset 2 (upper right plot) it is clear that we would like to model this as a quadratic curve and what we would like to do is to allow the model to include a quadratic term when the data supports it (as dataset 2 does) and refrain from including a quadratic term when the data supports a linear trend (as in datasets 1 and 3). A solution is to include a quadratic term (b2) with a spike and slab prior distribution which is a mixture between two distributions one thin (spiky) distribution and one wide (slab) distribution. By centering the spike over zero we introduce a bit of automatic model selection into our model, that is, if there is evidence for a quadratic trend in the data then the slab will allow this trend to be included in the model, however, if there is little evidence for a quadratic trend then the spike will make the estimate of b2 practically equivalent to zero. In JAGS such a prior can be implemented as follows:

b2 <- b2_prior[b2_pick + 1]
b2_prior[1] ~ dnorm(0, 99999999999999) # spike
b2_prior[2] ~ dnorm(0, 0.1) # slab
b2_pick ~ dbern(0.5)

The argument to the dbern function indicates our prior belief that the data includes a quadratic term, in this case we think the odds are 50/50. The resulting prior looks like this:

Read on →

ChildFreq: a Tool to Explore Word Frequencies in Child Language.

2013-06-20

Have you ever wondered if children prefer bananas over candy or when their fascination for dinosaurs kick in? These are the kinds of questions you can get answered on my new webpage ChildFreq. Using a huge child language database ChildFreq shows you what words children use at what age. Let’s look at some querries and let’s start with banana vs. candy. Seems like Banana start out as the leader but then Candy gains speed, passes Banana at around 30 months and finishes as the winner with a good marginal, go Candy!
Read on →