xkcd’s Mirrorboard: A Statistical Language Model Approach
Downloads
Sample bi-gram model - trained using History of the United States, Author: Charles A. Beard and Mary R. Beard (downloaded from Project Gutenberg) and the linux “words” on my system.
Backround
I was back in the midwest over the holidays and due to the temperature hovering near zero and it being the midwest, I had plenty of time to sit in coffee shops and catch up on all sorts of internet time killers. So, when browsing the xkcd blag I came across the idea of using just one hand on the keyboard, but still typing efficiently — sort of. I didn’t like the idea of having to potentially hit CapsLock for every letter, but I started playing with the “cat words | grep -i “^[qp][a’][ru][dk][ei][vn][gh]$”" type command to lookup the list of possible valid words given the ambiguous input as described in Notes: (4). The challenge now is to find the best word or words for the input, which reminded me of some systems I had seen before.
I worked on computational linguistics research when I was in grad school, and one of the topics I became interested in is language model-backed input methods such as Dasher, which allows you to input characters by steering with a mouse through approaching columns of characters. The cool part is that the size of each character is proportional to the probability of it coming next, as computed from the stats making up the language model. This makes it easier to steer because you usually don’t have to move the mouse very much. Now, other so-called “intelligent” input methods are popping up all over the place — especially with so many people using cell phones for email and web browsing, which have limited input abilities.
So, in a highly caffeinated state, I hacked together a proof-of-concept “intelligent guess” program that uses a corpus of text to train a word-level bigram language model. That model is used to output the most likely string of valid words given ‘mirrorboard encoded’ input (tges es a test = this is a test, wwrd wf tge dat = word of the day).
Use
The program is composed of three java classes, TrainBigram, MirrorBoardEncode and OHK. TrainBigram’s main method expects one command line argument - the corpus file. The format of the corpus file is any plain text document, with an optional word list at the end that will add words to the model, but will not be counted in the bigram stats; this allows for more complete coverage of any words not in the corpus. Anything you want to be part of the word list needs to follow a line containing only “<<**WORDLIST**>>”.
I found an easy way to build a corpus was to go to project gutenberg’s site and download a few books in plain text that used the type of writing I wanted to model. Then just cat them together:
cat *.txt > corpus
then add the standard linux word list to the end for a more complete vocabulary:
echo "<<**WORDLIST**>>” >> corpus
cat /usr/share/dict/words >> corpus
After that you are ready to train the bigram model. I find that with a few megabytes worth of text (0.5 - 1 million words) training the model takes more than the default amount of memory available to the JVM, so include “-Xmx256m” to use up to 256m of memory:
java -Xmx256m TrainBigram corpus
this will output a trained bigram file called “corpus.bg”. At this point we are ready to run the main program to decode mirrorboard input. The class OHK expects to have a trained bigram named “ohk.bg” in the same folder as it. Simply rename or link the bigram file output by TrainBigram:
mv corpus.bg ohk.bg
or:
ln -s corpus.bg ohk.bg
As an option for generating phrases to test with, MirrorBoardEncode takes normal input and outputs mirrorboard encoded text:
java MirrorBoardEncode
type a sample sentence here
Now, run OHK:
java -Xmx256m OHK
Input a line of mirrorboard encoded text and press enter. Hopefully, if your training corpus was large and relevant enough, you’ll see what you were trying to type in.
Theoretic Foundation
As far as language models go, this is a pretty homebrew design, but probably still qualifies as a Markov chain. As the training corpus is processed, words encountered for the first time are added to a hashtable with a value of 1.0, on subsequent encounters with the same word, the value is incremented by one. In addition to words being added, bigrams are formed by joining two consecutive words together with a separating character and added to the hashtable in the same manner.
After the corpus has been processed probabilities are computed based on word counts. For individual words, I store (frequency of word)/(total word count) as its probability; this helps me cheat by not smoothing the model. For bigrams, the conditional probability is used; p(word2 | word1) = probability of encountering word2 given word1 = (frequency of word1 and word2)/(frequency of word1). As a way to get by without any kind of smoothing for bigrams with a count of zero, I use the (probability of word1) * (probability of word2) in place of the bigram probability. I can imagine this makes statisticians furious, but to me it is an estimate of the bigram probability because it is literally the probability of picking those two particular words out of the training corpus, but unfortunately doesn’t take into consideration any information about word order.
Implementation details
The model is simply a java.util.Hashtable object, with Strings for keys and Doubles for values. The main program is composed of two main functions, getLegalWords() and getBestSentence().
getLegalWords() does the same thing as the sample grep command from xkcd, it returns all legal words given the input. For performance reasons this was implemented as a recursive function that looks up only the possible combinations of letters corresponding to the input, therefore doesn’t search the entire word list. The caveat with this is that it runs in exponential time, which is usually much faster than an iterative approach, but if 2 ^ (input length) is greater than the number of words in the word list, the iterative approach is faster. For a typical corpus the max length will be about 16.5 characters. For this reason there is also an iterative implementation of getLegalWords().
getBestSentence() is implemented as a recursive function as well, more for ease of programming than performance, though. getBestSentence() builds every possible sentence given the list of possible alternative words. It scores each sentence by multiplying all bigram probabilities together (substituting estimations as described above, if needed) and returns the sentence with the maximum probability.
There are three interactive commands you can enter in OHK. Typing “\d” on a line toggles debugging output, which will print out all possible word expansions and all possible sentences and their scores for each line of input. Typing “\t” toggles the option to print the time in milliseconds it takes for both the call to getLegalWords() and the call to getBestSentence(). Typing “\q” quits the program.
Future Improvements
This was done mostly to relieve boredom and kill time so I didn’t spend much time contemplating the best strategy, but now that I’ve finished I do see that a different approach would probably make a much more useful and powerful product.
The fundamental design element I would change, if I were doing this again, but for real this time, would be to use character n-grams instead of word n-grams, and to change model to be a true HMM (hidden Markov model). There are quite a few reasons why I think this would be an improvement, beside that fact that the problem is essentially one of interpreting characters not words.
Using word-based n-grams requires a relatively large amount of input (several words) to get reasonable accuracy; using character-based n-grams would achieve reasonable accuracy with much less input (several characters). Also you would be getting much more out of your corpora by couting each character as a data point, instead of just each word. Of course, it would really help to use 3-grams or 4-grams to further boost accuracy and prevent over-training of very common character combinations.
To make the model an HMM the training corpus would first have to contain the mirrorboard-encoded text along with the normal text. The encoded-text n-grams would be the states and all the resulting decoded n-grams would be hidden variables. I would explain here more how HMM’s work, but this post is running long as it is so I’ll direct you to an old paper I wrote a while back on the subject, or better yet you can read all about it on wikipedia.


