Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("https://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
While not working exactly as Norvig’s version it should result in similar spelling corrections:
correct("piese")
## [1] "piece"
correct("ov")
## [1] "of"
correct("cakke")
## [1] "cake"
So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):
# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("https://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))
correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
edit_dist <- adist(word, sorted_words)
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}
Some thoughts:
- The main reason for why the R version is so short is because base R includes the
adist
function. (A one line spell checker in R is indeed possible using theaspell
function :) - A second reason for why the R version is so short is that the many vectorized functions in R make it possible to do a lot of work in one line.
- Indeed, the horrible line creating the
sorted_words
vector would be a perfect target for some magrittr magic. - The R version does not solve the problem in exactly the same way as Norvig’s code. He maintains the count of each word in the
NWORDS
variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :). - There are links to implementations in many other languages at the bottom of
Norvig’s essay. Looking at
the Java version reminds me of my dark Java past and madness like
HashMap<Integer, String> candidates = new HashMap<Integer, String>();
.
Posted by Rasmus Bååth | 2014-12-17 | Tags:
R