#Answers to Bayesian Computation with Stan and Farmer Jöns" *Rasmus Bååth* One thing to note is that the code changes you have to make between questions often are *minimal*. Yet we go from running a simple binomial model to running a pretty advanced linear model. All answers below use wide "sloppy" uniform priors, and these could certainly be shaped up and be made more informative. ## Question I Not much to do here, other than to run it. Here is the graph you would see if everything is working properly. ```{r message=FALSE, echo=FALSE, results='hide'} library(rstan) # The Stan model as a string. model_string <- " data { # Number of data points int n1; int n2; # Number of successes int y1[n1]; int y2[n2]; } parameters { real theta1; real theta2; } model { theta1 ~ beta(1, 1); theta2 ~ beta(1, 1); y1 ~ bernoulli(theta1); y2 ~ bernoulli(theta2); } generated quantities { } " y1 <- c(0, 1, 0, 0, 0, 0, 1, 0, 0, 0) y2 <- c(0, 0, 1, 1, 1, 0, 1, 1, 1, 0) data_list <- list(y1 = y1, y2 = y2, n1 = length(y1), n2 = length(y2)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` ## Question 2 ```{r} s <- as.data.frame(stan_samples) mean(abs(s\$theta2 - s\$theta1) < 0.2) ``` ## Question 3 ```{r message=FALSE, results='hide'} cowA <- c(0, 1, 0, 0, 0, 0, 1, 0, 0, 0) cowB <- c(0, 0, 1, 1, 1, 0, 1, 1, 1, 0) # Using the same model as in Question 1, just using the new data. data_list <- list(y1 = cowA, y2 = cowB, n1 = length(cowA), n2 = length(cowB)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` Calculate the probability that medicine A is better than medicine B. ```{r} s <- as.data.frame(stan_samples) mean(s\$theta1 > s\$theta2) mean(s[,"theta1"] > s[,"theta2"]) ``` So should probably go with medicine B then... ## Question 4 ```{r message=FALSE, results='hide'} # The Stan model as a string. model_string <- " data { int n1; int n2; vector[n1] y1; vector[n2] y2; } parameters { real mu1; real mu2; real sigma1; real sigma2; } model { mu1 ~ uniform(0, 2000); mu2 ~ uniform(0, 2000); sigma1 ~ uniform(0, 1000); sigma2 ~ uniform(0, 1000); y1 ~ normal(mu1, sigma1); y2 ~ normal(mu2, sigma2); } generated quantities { } " diet_milk <- c(651, 679, 374, 601, 401, 609, 767, 709, 704, 679) normal_milk <- c(798, 1139, 529, 609, 553, 743, 151, 544, 488, 555, 257, 692, 678, 675, 538) data_list <- list(y1 = diet_milk, y2 = normal_milk, n1 = length(diet_milk), n2 = length(normal_milk)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` Is it likely that the diet is going to make the cows produce more milk on average? ```{r} s <- as.data.frame(stan_samples) mu_diff <- s\$mu2 - s\$mu1 hist(mu_diff) mean(mu_diff > 0) mean(mu_diff < 0) ``` It is almost as likely that the diet is better as that the diet is worse. So this experiment does not really support that the diet will result in the cows producing more milk . ## Question 5 ```{r message=FALSE, results='hide'} # The Stan model as a string. model_string <- " data { int n1; int n2; vector[n1] y1; vector[n2] y2; } parameters { real mu1; real mu2; real sigma1; real sigma2; } model { mu1 ~ uniform(0, 2000); mu2 ~ uniform(0, 2000); sigma1 ~ uniform(0, 1000); sigma2 ~ uniform(0, 1000); y1 ~ student_t(3, mu1, sigma1); y2 ~ student_t(3, mu2, sigma2); } generated quantities { } " diet_milk <- c(651, 679, 374, 601, 4000, 401, 609, 767, 3890, 704, 679) normal_milk <- c(798, 1139, 529, 609, 553, 743, 3,151, 544, 488, 15, 257, 692, 678, 675, 538) data_list <- list(y1 = diet_milk, y2 = normal_milk, n1 = length(diet_milk), n2 = length(normal_milk)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` Is it likely that diet is going to make the cows produce more milk on average? ```{r} s <- as.data.frame(stan_samples) mu_diff <- s\$mu2 - s\$mu1 hist(mu_diff) mean(mu_diff > 0) mean(mu_diff < 0) ``` Again there is no strong evidence that the diet is any good (but compare with the result, would you have used the original Normal model!). ## Question 6 ```{r message=FALSE, results='hide'} # The Stan model as a string. model_string <- " data { int n1; int n2; int y1[n1]; int y2[n2]; } parameters { real lambda1; real lambda2; } model { lambda1 ~ uniform(0, 100); lambda2 ~ uniform(0, 100); y1 ~ poisson(lambda1); y2 ~ poisson(lambda2); } generated quantities { } " diet_eggs <- c(6, 4, 2, 3, 4, 3, 0, 4, 0, 6, 3) normal_eggs <- c(4, 2, 1, 1, 2, 1, 2, 1, 3, 2, 1) data_list <- list(y1 = diet_eggs, y2 = normal_eggs, n1 = length(diet_eggs), n2 = length(normal_eggs)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` Is it likely that diet going to make the chickens produce more eggs on average? ```{r} s <- as.data.frame(stan_samples) lambda_diff <- s\$lambda1 - s\$lambda2 hist(lambda_diff) mean(lambda_diff > 0) ``` There is pretty good evidence that the diet is effective and that chickens on the diet produce more eggs on average (that is, lambda1 seems higher than lambda2). Looking at `lambda_diff` a "best guess" is that the diet results in around 1-2 more eggs on average. ## Question 7 This implements the same model as in question 4, but using smarter indexing so that the code is not as redundant and so that it works with the format of the data in the data.frame `d` . ```{r message=FALSE, results='hide'} # The Stan model as a string. model_string <- " data { int n; int n_groups; int x[n]; vector[n] y; } parameters { vector[n_groups] mu; vector[n_groups] sigma; } model { mu ~ uniform(0, 2000); sigma ~ uniform(0, 1000); y ~ normal(mu[x], sigma[x]); } generated quantities { } " d <- data.frame( milk = c(651, 679, 374, 601, 401, 609, 767, 709, 704, 679, 798, 1139, 529, 609, 553, 743, 151, 544, 488, 555, 257, 692, 678, 675, 538), group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)) data_list <- list(y = d\$milk, x = d\$group, n = length(d\$milk), n_groups = max(d\$group)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) ``` This should give you the same result as in question 4. ## Question 8 Amazingly we don't have to change the model at all from question 7, we can just rerun it with the new data. That is, if we were smart with how we defined the priors and instead of writing: ``` mu[1] ~ uniform(0, 2000); mu[2] ~ uniform(0, 2000); ``` simply wrote `mu ~ uniform(0, 2000);`. ```{r, message=FALSE, warning=FALSE, results='hide'} d <- data.frame( milk = c(651, 679, 374, 601, 401, 609, 767, 709, 704, 679, 798, 1139, 529, 609, 553, 743, 151, 544, 488, 555, 257, 692, 678, 675, 538, 1061, 721, 595, 784, 877, 562, 800, 684, 741, 516), group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)) data_list <- list(y = d\$milk, x = d\$group, n = length(d\$milk), n_groups = max(d\$group)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) # Now comparing the tree different groups. s <- as.data.frame(stan_samples) hist(s[,"mu[3]"] - s[,"mu[1]"] ) hist(s[,"mu[3]"] - s[,"mu[2]"] ) mean(s[,"mu[3]"] - s[,"mu[1]"] > 0) mean(s[,"mu[3]"] - s[,"mu[2]"] > 0) ``` So it is pretty likely that diet 2 (`mu[3]`) is better than both diet 1 (`mu[2]`) and using no special diet (`mu[1]`). ## Question 9 So, let's change the model from question 7 into a regression model! ```{r message=FALSE, warning=FALSE, results='hide'} # The Stan model as a string. model_string <- " data { int n; vector[n] x; vector[n] y; } parameters { real beta0; real beta1; real sigma; } model { vector[n] mu; beta0 ~ uniform(-1000, 1000); beta1 ~ uniform(-1000, 1000); sigma ~ uniform(0, 1000); mu = beta0 + beta1 * x; y ~ normal(mu, sigma); } generated quantities { } " d <- data.frame(milk = c(685, 691, 476, 1151, 879, 725, 1190, 1107, 809, 539, 298, 805, 820, 498, 1026, 1217, 1177, 684, 1061, 834), hours = c(3, 7, 6, 10, 6, 5, 10, 11, 9, 3, 6, 6, 3, 5, 8, 11, 12, 9, 5, 5)) data_list <- list(y = d\$milk, x = d\$hours, n = length(d\$milk)) # Compiling and producing posterior samples from the model. stan_samples <- stan(model_code = model_string, data = data_list) ``` ```{r} # Plotting and summarizing the posterior distribution stan_samples plot(stan_samples) plot(d\$hours, d\$milk, xlim=c(0, 13), ylim = c(0, 1300)) # Adding a sample of the posterior draws to the plot in order to visualize the # uncertainty of the regression line. s <- as.data.frame(stan_samples) for(i in sample(nrow(s), size = 20)) { abline(s[i,"beta0"], s[i,"beta1"], col = "gray") } ``` It seems like there is good evidence that an increase in sunshine (or something that co-varies with sunshine perhaps...) results in an increase in milk production.