Publishable Stuff

Rasmus Bååth's Blog


Tidbits from the Books that Defined S (and R)

2014-11-05

Why R? Because S!

R is the open source implementation (and a pun!) of S, a language for statistical computing that was developed at Bell Labs in the late 1970s. After that, the implementation of S underwent a number of major revisions documented in a series of seminal books, often just referred to by the color of their cover: The Brown Book, the Blue Book, the White Book and the Green Book. To satisfy my techno-historical lusts I recently acquired all these books and I though I would share some tidbits from them, highlighting how S (and thus R) developed into what we today love and cherish. But first, here are the books in chronological order from left to right:

Most of these are out of print, but all can be bought second hand on, for example, Amazon (where is where I got them and where the links above lead).

S: An Interactive Environment for Data Analysis and Graphics (1984) A.K.A. the Brown Book

by Richard A. Becker and John M. Chambers

This book from 1984 describes not the first version of S, but the second (S2) according to the versioning used here by Chambers. It describes a language that is very similar to modern R (but also very different). We recognize friends like c

… and plot:

But note that plot was only for scatter plots and was not a generic function producing different types of plots as in modern R. This, because S didn’t yet have objects and classes. S had, however, state of the art graphing capabilities from the start, implementing the plot types described in Graphical Methods for Data Analysis (1983) (also written by John M. Chambers and which I’ve written about here). For example, the very useful pairs function was already there:

While many things were similar to modern R, not everything was. For one thing, you could not define your own functions! Instead you would have to rely on macros:

Here ?T in the macro is another macro producing a temporary variable name in order to not clash with any global variable name, crazy!

We also find answer to why some of the peculiarities of modern R exists. Have you ever wondered why many function and parameter names in R are period.separated rather than underscored_spearated? Well, because in S2 underscore was an alias for <-!

On to a surprise finding… Rstudio are doing great things and for a while it has been possible to make slides using R markdown in Rstudio. Is this great? Sure! Is it new? Nope… :) Slide construction was already easy to do in S anno 1984 using the vu function. This function took a string written in a special markup language…

… and produced slides on the graphic device, such as this:

Unfortunately vu didn’t make it all the way to modern R.

I don’t want to brag, but I’m gonna do it anyway: I recently got my copy of the “Brown book” signed by John Chambers himself at the UseR 2014 conference! :D

S: An Interactive Environment for Data Analysis and Graphics on Amazon

Extending the S System (1985)

by Richard A. Becker and John M. Chambers

This book is not part of the color book canon, but I’ll include it for completeness anyway. Published the year after the Brown book, it describes how to implement new functions in S. However, as S only had support for macros, these functions would have to be written in another language (say FORTRAN) and then connected to S using a special interface language:

While not relevant to modern R, this interface language is the “ancestor” of modern day interfaces such as Rcpp and Rcpp11.

Extending The S System on Amazon

The New S Language: A Programming Environment for Data Analysis and Graphics (1988) A.K.A. the Blue Book

by Richard A. Becker, John M. Chambers and Allan R. Wilks

This book introduces S version three (S3) which was a major revision of S2. While S2 was primarily programmed in FORTRAN, S3 was mainly done in C. The interface language was now gone and instead C functions could be directly invoked from S functions. But what’s more, users could now easily define functions themselves!

Functions were also first class citizens and could be passed around thus enabling the modern apply type functions:

Computation on the language was also now possible, for example, by using substitute. Some things were still different from modern day R, take a look at the following statement:

Why lottery.number and lottery.payoff instead of lottery$number and lottery$payoff? Because data.frames didn’t yet exist! (Though it would still have been possible to stick two vectors inside a list.)

The New S Language: A Programming Environment for Data Analysis and Graphics on Amazon

Statistical Models in S (1992) A.K.A. the White Book

edited by John M. Chambers and Trevor J. Hastie

This book “completes” the specification of S3 with three biggies: (1) data frames, (2) formulas…

… and (3) object orientation:

While the earlier books are more focused on graphics and programming, this book is all about statistical models (the title of the book might be a hint). Here we get introduced to workhorses like glm, gam, nls, tree and, not to forget, lm:

There is, however, no mention of the classical *.test functions such as t.test, binom.test and cor.test (Do anybody know when they appeared in S/R?). The focus is also more on prediction and estimation rather than testing, for example, p-values were not reported as part of summary.lm (which they are in modern R):

Other things that are new are ?, which can now be used to look up help pages, and that there is a new datatype called factor. And already from the start read.table converted all strings to factors by default. :) All in all, this book was interesting to read and is still, I believe, a very good introduction to the formula interface and the lm/glm/gam type functions.

Statistical Models in S on Amazon

Programming with Data : A guide to the S Language (1998) A.K.A. the Green Book

by John M. Chambers

This book describes S version four and focuses almost exclusively on programming and not so much on stats and graphics. A big change from S3 was the introduction of a new, more formal, system for object oriented programming:

Other than that there weren’t any eye catching differences from S version 3. One small thing to note is that = could now be used for assignment instead of <- and is actually used consistently throughout the book:

Programming with Data: A Guide to the S Language on Amazon

That was all I had. If you are further interested in the history of S and R I also recommend A brief history of S (Becker, 1994) , Stages in the Evolution of S (Chamers, 2000) and R: Past and future history (Ihaka, 1998).

All images and quotes included in this review are copyrighted by their respective copyrighted holders, however I believe that the inclusion of these quotes and images in in this review constitutes fair use.

Posted by Rasmus Bååth | 2014-11-05 | Tags: Statistics, R