Publishable Stuff

Rasmus Bååth's Blog


Hello stranger, and welcome! 👋😊
I'm Rasmus Bååth, data scientist, engineering manager, father, husband, tinkerer, tweaker, coffee brewer, tea steeper, and, occasionally, publisher of stuff I find interesting down below👇


Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part two.

2013-07-28

In the last blog post I showed my initial attempt at modeling football results in La Liga using a Bayesian Poission model, but there was one glaring problem with the model; it did not consider the advantage of being the home team. In this post I will show how to fix this! I will also show a way to deal with the fact that the dataset covers many La Liga seasons and that teams might differ in skill between seasons.

Modeling Match Results: Iteration 2

The only change to the model needed to account for the home advantage is to split the baseline into two components, a home baseline and an away baseline. The following JAGS model implements this change by splitting baseline into home_baseline and away_baseline.

# model 2
m2_string <- "model {
for(i in 1:n_games) {
  HomeGoals[i] ~ dpois(lambda_home[HomeTeam[i],AwayTeam[i]])
  AwayGoals[i] ~ dpois(lambda_away[HomeTeam[i],AwayTeam[i]])
}

for(home_i in 1:n_teams) {
  for(away_i in 1:n_teams) {
    lambda_home[home_i, away_i] <- exp( home_baseline + skill[home_i] - skill[away_i])
    lambda_away[home_i, away_i] <- exp( away_baseline + skill[away_i] - skill[home_i])
  }
}

skill[1] <- 0 
for(j in 2:n_teams) {
  skill[j] ~ dnorm(group_skill, group_tau)
}

group_skill ~ dnorm(0, 0.0625)
group_tau <- 1/pow(group_sigma, 2)
group_sigma ~ dunif(0, 3)

home_baseline ~ dnorm(0, 0.0625)
away_baseline ~ dnorm(0, 0.0625)
}"

Another cup of coffee while we run the MCMC simulation…

Read on →

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part one.

2013-07-21

Update: This series of posts are now also available as a technical report:
Bååth, R. (2015), Modeling Match Results in Soccer using a Hierarchical Bayesian Poisson Model. LUCS Minor, 18. (ISSN 1104-1609) ( pdf)

This is a slightly modified version of my submission to the UseR 2013 Data Analysis Contest which I had the fortune of winning :) The purpose of the contest was to do something interesting with a dataset consisting of the match results from the last five seasons of La Liga, the premium Spanish football (aka soccer) league. In total there were 1900 rows in the dataset each with information regarding which was the home and away team, what these teams scored and what season it was. I decided to develop a Bayesian model of the distribution of the end scores. Here we go…

Ok, first I should come clean and admit that I know nothing about football. Sure, I’ve watched Sweden loose to Germany in the World Cup a couple of times, but that’s it. Never the less, here I will show an attempt to to model the goal outcomes in the La Liga data set provided as part of the UseR 2013 data analysis contest. My goal is not only to model the outcomes of matches in the data set but also to be able to (a) calculate the odds for possible goal outcomes of future matches and (b) to produce a credible ranking of the teams. The model I will be developing is a Bayesian hierarchical model where the goal outcomes will be assumed to be distributed according to a Poisson distribution. I will focus more on showing all the cool things you can easily calculate in R when you have a fully specified Bayesian Model and focus less on model comparison and trying to find the model with highest predictive accuracy (even though I believe my model is pretty good). I really would like to see somebody try to do a similar analysis in SPSS (which most people uses in my field, psychology). It would be a pain!

This post assumes some familiarity with Bayesian modeling and Markov chain Monte Carlo. If you’re not into Bayesian statistics you’re missing out on something really great and a good way to get started is by reading the excellent Doing Bayesian Data Analysis by John Kruschke. The tools I will be using is R (of course) with the model implemented in JAGS called from R using the rjags package. For plotting the result of the MCMC samples generated by JAGS I’ll use the coda package, the mcmcplots package, and the plotPost function courtesy of John Kruschke. For data manipulation I used the plyr and stringr packages and for general plotting I used ggplot2. This report was written in Rstudio using knitr and xtable.

I start by loading libraries, reading in the data and preprocessing it for JAGS. The last 50 matches have unknown outcomes and I create a new data frame d holding only matches with known outcomes. I will come back to the unknown outcomes later when it is time to use my model for prediction.

Read on →

useR 2013 was a blast!

2013-07-14

I had a great time at useR 2013 in Albacete, Spain. The food was great, the people were fun and the weather was hot. A pleasant surprise was that I won the useR data analysis contest with my submission “Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model.” It was a fun exercise modeling football scores and I might write a post about it later, for now you can download my original submission here if you’re interested.
Read on →

Three Ways to Run Bayesian Models in R

2013-06-23

There are different ways of specifying and running Bayesian models from within R. Here I will compare three different methods, two that relies on an external program and one that only relies on R. I won’t go into much detail about the differences in syntax, the idea is more to give a gist about how the different modeling languages look and feel. The model I will be implementing assumes a normal distribution with fairly wide priors:

$$y_i \sim \text{Normal}(\mu,\sigma)$$

$$\mu \sim \text{Normal}(0,100)$$

$$\sigma \sim \text{LogNormal}(0,4)$$

Let’s start by generating some normally distributed data to use as example data in the models.

set.seed(1337)
y <- rnorm(n = 20, mean = 10, sd = 5)
mean(y)
## [1] 12.72
sd(y)
## [1] 5.762

Method 1: JAGS

JAGS (Just Another Gibbs Sampler) is a program that accepts a model string written in an R-like syntax and that compiles and generate MCMC samples from this model using Gibbs sampling. As in BUGS, the program that inspired JAGS, the exact sampling procedure is chosen by an expert system depending on how your model looks. JAGS is conveniently called from R using the rjags package and John Myles White has written a nice introduction to using JAGS and rjags. I’ve seen it recommended to put the JAGS model specification into a separate file but I find it more convenient to put everything together in one .R file by using the textConnection function to avoid having to save the model string to a separate file. Now, the following R code specifies and runs the model above for 20000 MCMC samples:

library(rjags)

# The model specification
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, tau)
  }
  mu ~ dnorm(0, 0.0001)
  sigma ~ dlnorm(0, 0.0625)
  tau <- 1 / pow(sigma, 2)
}"

# Running the model
model <- jags.model(textConnection(model_string), data = list(y = y), n.chains = 3, n.adapt= 10000)
update(model, 10000); # Burnin for 10000 samples
mcmc_samples <- coda.samples(model, variable.names=c("mu", "sigma"), n.iter=20000)
Read on →

Bayesian estimation supersedes the t-test - Online

2013-06-20

The t-test is probably the most used and abused statistical test in research [ Citation Needed]. The t-test is used when you have two groups of measures and you want to look at the difference between the groups and the original t-test assumes that the measures are approximately normally distributed and that the two groups have equal variance. While this last assumption is relaxed by Welch’s t-test a large problem remains: The t-test isn’t Bayesian :)
Read on →

Bayesian Modeling of Anscombe’s Quartet

2013-06-20

Anscombe’s quartet is a collection of four datasets that look radically different yet result in the same regression line when using ordinary least square regression. The graph below shows Anscombe’s quartet with imposed regression lines (taken from the Wikipedia article).

Ancombe&rsquo;s Quartet

While least square regression is a good choice for dataset 1 (upper left plot) it fails to capture the shape of the other three datasets. In a recent post John Kruschke shows how to implement a Bayesian model in JAGS that captures the shape of both data set 1 and 3 (lower left plot). Here I will expand that model to capture the shape of all four data sets. If that sounds interesting start out by reading Kruschke’s post and I will continue where that post ended…

Ok, using a wide tailed t distribution it was possible to down weight the influence of the outlier in dataset 3 while still capturing the shape of dataset 1. It still fails to capture datasets 2 and 4 however. Looking at dataset 2 (upper right plot) it is clear that we would like to model this as a quadratic curve and what we would like to do is to allow the model to include a quadratic term when the data supports it (as dataset 2 does) and refrain from including a quadratic term when the data supports a linear trend (as in datasets 1 and 3). A solution is to include a quadratic term (b2) with a spike and slab prior distribution which is a mixture between two distributions one thin (spiky) distribution and one wide (slab) distribution. By centering the spike over zero we introduce a bit of automatic model selection into our model, that is, if there is evidence for a quadratic trend in the data then the slab will allow this trend to be included in the model, however, if there is little evidence for a quadratic trend then the spike will make the estimate of b2 practically equivalent to zero. In JAGS such a prior can be implemented as follows:

b2 <- b2_prior[b2_pick + 1]
b2_prior[1] ~ dnorm(0, 99999999999999) # spike
b2_prior[2] ~ dnorm(0, 0.1) # slab
b2_pick ~ dbern(0.5)

The argument to the dbern function indicates our prior belief that the data includes a quadratic term, in this case we think the odds are 50/50. The resulting prior looks like this:

Read on →

ChildFreq: a Tool to Explore Word Frequencies in Child Language.

2013-06-20

Have you ever wondered if children prefer bananas over candy or when their fascination for dinosaurs kick in? These are the kinds of questions you can get answered on my new webpage ChildFreq. Using a huge child language database ChildFreq shows you what words children use at what age. Let’s look at some querries and let’s start with banana vs. candy. Seems like Banana start out as the leader but then Candy gains speed, passes Banana at around 30 months and finishes as the winner with a good marginal, go Candy!
Read on →

Eye Tapping

2013-06-20

A short paper I presented at the International Conference on New Interfaces for Musical Expression (NIME), 30 May – 1 June 2011, Oslo, Norway. This paper was heavily inspired by Hornof, A., & Vessey, K. (2011). Abstract The aim of this study was to investigate how well subjects beat out a rhythm using eye movements and to establish the most accurate method of doing this. Eighteen subjects participated in an experiment were five different methods were evaluated.
Read on →

Robert MacDougall’s “The Structure of Simple Rhythm forms”, 1903.

2013-06-20

After much googling I finally found a copy of Robert MacDougall’s “The Structure of Simple Rhythm forms” from 1903. It was hidden in Harvard Psychological Studies, Volume 1 now freely available from Project Gutenberg ( https://www.gutenberg.org/ebooks/16266 . Since it seems the book is now in the public domain I took the liberty to convert “The Structure of Simple Rhythm forms” into pdf-format and post it here so that it might be more easily found in the future.
Read on →

Subjective Difficulty of Tapping to a Slow Beat

2013-06-20

A short paper I presented at the 12th International Conference on Music Perception and Cognition , Thessaloniki, Greece. A great conference, by the way, except for the heat... Abstract The current study investigates the slower limit of rhythm perception and participants subjective difficulty when tapping to a slow beat. Thirty participants were asked to tap to metronome beats ranging in tempo from 600 ms to 3000 ms between each beat. After each tapping trial the participants rated the difficulty of keeping the beat on a seven point scale ranging from “very easy” to “very difficult”.
Read on →