xkcd’s Mirrorboard: A Statistical Language Model Approach

Downloads

Java classes and source

Sample bi-gram model - trained using History of the United States, Author: Charles A. Beard and Mary R. Beard (downloaded from Project Gutenberg) and the linux “words” on my system.

Backround

I was back in the midwest over the holidays and due to the temperature hovering near zero and it being the midwest, I had plenty of time to sit in coffee shops and catch up on all sorts of internet time killers. So, when browsing the xkcd blag I came across the idea of using just one hand on the keyboard, but still typing efficiently — sort of. I didn’t like the idea of having to potentially hit CapsLock for every letter, but I started playing with the “cat words | grep -i “^[qp][a’][ru][dk][ei][vn][gh]$”" type command to lookup the list of possible valid words given the ambiguous input as described in Notes: (4). The challenge now is to find the best word or words for the input, which reminded me of some systems I had seen before.

I worked on computational linguistics research when I was in grad school, and one of the topics I became interested in is language model-backed input methods such as Dasher, which allows you to input characters by steering with a mouse through approaching columns of characters. The cool part is that the size of each character is proportional to the probability of it coming next, as computed from the stats making up the language model. This makes it easier to steer because you usually don’t have to move the mouse very much. Now, other so-called “intelligent” input methods are popping up all over the place — especially with so many people using cell phones for email and web browsing, which have limited input abilities.

So, in a highly caffeinated state, I hacked together a proof-of-concept “intelligent guess” program that uses a corpus of text to train a word-level bigram language model. That model is used to output the most likely string of valid words given ‘mirrorboard encoded’ input (tges es a test = this is a test, wwrd wf tge dat = word of the day).

Use

The program is composed of three java classes, TrainBigram, MirrorBoardEncode and OHK. TrainBigram’s main method expects one command line argument - the corpus file. The format of the corpus file is any plain text document, with an optional word list at the end that will add words to the model, but will not be counted in the bigram stats; this allows for more complete coverage of any words not in the corpus. Anything you want to be part of the word list needs to follow a line containing only “<<**WORDLIST**>>”.

I found an easy way to build a corpus was to go to project gutenberg’s site and download a few books in plain text that used the type of writing I wanted to model. Then just cat them together:
cat *.txt > corpus
then add the standard linux word list to the end for a more complete vocabulary:
echo "<<**WORDLIST**>>” >> corpus
cat /usr/share/dict/words >> corpus
After that you are ready to train the bigram model. I find that with a few megabytes worth of text (0.5 - 1 million words) training the model takes more than the default amount of memory available to the JVM, so include “-Xmx256m” to use up to 256m of memory:
java -Xmx256m TrainBigram corpus
this will output a trained bigram file called “corpus.bg”. At this point we are ready to run the main program to decode mirrorboard input. The class OHK expects to have a trained bigram named “ohk.bg” in the same folder as it. Simply rename or link the bigram file output by TrainBigram:
mv corpus.bg ohk.bg
or:
ln -s corpus.bg ohk.bg

As an option for generating phrases to test with, MirrorBoardEncode takes normal input and outputs mirrorboard encoded text:
java MirrorBoardEncode
type a sample sentence here

Now, run OHK:
java -Xmx256m OHK

Input a line of mirrorboard encoded text and press enter. Hopefully, if your training corpus was large and relevant enough, you’ll see what you were trying to type in.

Theoretic Foundation

As far as language models go, this is a pretty homebrew design, but probably still qualifies as a Markov chain. As the training corpus is processed, words encountered for the first time are added to a hashtable with a value of 1.0, on subsequent encounters with the same word, the value is incremented by one. In addition to words being added, bigrams are formed by joining two consecutive words together with a separating character and added to the hashtable in the same manner.

After the corpus has been processed probabilities are computed based on word counts. For individual words, I store (frequency of word)/(total word count) as its probability; this helps me cheat by not smoothing the model. For bigrams, the conditional probability is used; p(word2 | word1) = probability of encountering word2 given word1 = (frequency of word1 and word2)/(frequency of word1). As a way to get by without any kind of smoothing for bigrams with a count of zero, I use the (probability of word1) * (probability of word2) in place of the bigram probability. I can imagine this makes statisticians furious, but to me it is an estimate of the bigram probability because it is literally the probability of picking those two particular words out of the training corpus, but unfortunately doesn’t take into consideration any information about word order.

Implementation details

The model is simply a java.util.Hashtable object, with Strings for keys and Doubles for values. The main program is composed of two main functions, getLegalWords() and getBestSentence().

getLegalWords() does the same thing as the sample grep command from xkcd, it returns all legal words given the input. For performance reasons this was implemented as a recursive function that looks up only the possible combinations of letters corresponding to the input, therefore doesn’t search the entire word list. The caveat with this is that it runs in exponential time, which is usually much faster than an iterative approach, but if 2 ^ (input length) is greater than the number of words in the word list, the iterative approach is faster. For a typical corpus the max length will be about 16.5 characters. For this reason there is also an iterative implementation of getLegalWords().

getBestSentence() is implemented as a recursive function as well, more for ease of programming than performance, though. getBestSentence() builds every possible sentence given the list of possible alternative words. It scores each sentence by multiplying all bigram probabilities together (substituting estimations as described above, if needed) and returns the sentence with the maximum probability.

There are three interactive commands you can enter in OHK. Typing “\d” on a line toggles debugging output, which will print out all possible word expansions and all possible sentences and their scores for each line of input. Typing “\t” toggles the option to print the time in milliseconds it takes for both the call to getLegalWords() and the call to getBestSentence(). Typing “\q” quits the program.

Future Improvements

This was done mostly to relieve boredom and kill time so I didn’t spend much time contemplating the best strategy, but now that I’ve finished I do see that a different approach would probably make a much more useful and powerful product.

The fundamental design element I would change, if I were doing this again, but for real this time, would be to use character n-grams instead of word n-grams, and to change model to be a true HMM (hidden Markov model). There are quite a few reasons why I think this would be an improvement, beside that fact that the problem is essentially one of interpreting characters not words.

Using word-based n-grams requires a relatively large amount of input (several words) to get reasonable accuracy; using character-based n-grams would achieve reasonable accuracy with much less input (several characters). Also you would be getting much more out of your corpora by couting each character as a data point, instead of just each word. Of course, it would really help to use 3-grams or 4-grams to further boost accuracy and prevent over-training of very common character combinations.

To make the model an HMM the training corpus would first have to contain the mirrorboard-encoded text along with the normal text. The encoded-text n-grams would be the states and all the resulting decoded n-grams would be hidden variables. I would explain here more how HMM’s work, but this post is running long as it is so I’ll direct you to an old paper I wrote a while back on the subject, or better yet you can read all about it on wikipedia.

Comments

Gaggia Espresso Mod Internals

Comments (1)

Gaggia Espresso PID Arduino Mod

Modding the Gaggia Espresso:

Adding a PID temperature controller and computer interface.

gaggia espresso pid mod
This really isn’t about aesthetics, but I’ll probably clean up the install a bit.

Goal

The goal of this project is to create a computer interface for my Gaggia Espresso machine replacing the factory thermal switches with a PID controller and the “steam” and “pump” toggle switches with a LCD/button menu system.

Motivation

After seeing the various instructions and kits geared toward modding a home espresso machine with a PID, and their costs and features compared to something like the PID PIC MOD, I decided I would start from scratch and build my own PID. Then I could easily extend it to control the rest of the espresso machine, such as adding a timer to monitor or control the shot time and adding temperature presets.

Hardware

Parts list:

  • 1x Arduino Decimillia
  • 1x 25A zero-cross SSR (for pump, could be much smaller, > 5A)
  • 1x 40A zero-cross SSR (for heating element, could be 25A)
  • 1x AD595 thermocouple conditioner/interface IC
  • 1x Type-K thermocouple (could use type T or possibly other)
  • 1x 16×2 character LCD with HD44780 parallel interface
  • 3x SPST normally-open momentary push-button switch
  • 10AWG stranded wire for 125V wiring (12AWG is probably good enough)
  • 22AWG solid wire for 5V wiring
  • Power supply for Arduino
  • Various resistors, capacitors, quick-disconnect connectors for 10AWG wire, etc.

I’m using an Arduino Decimillia micro-controller which utilizes the AVR AtMega168. The hookup of the components is pretty straight-forward; The LCD uses 6 digital IO pins - 4 for a 4 bit parallel data bus, one for Register Select and one for Operation (data read/write) enable signal. The SSRs take one digital IO pin each. The AD595 takes one 10 bit AD converter pin. The three buttons share one external interrupt pin and use two digital IO pins to indicate which button has been pressed; with this scheme, one interrupt can be used with 2^n buttons taking only n+1 pins. Various resistors, capacitors and other components are used, but the details are rather tedious and boring.

Software

The software was written in the Arduino Programming Language, which is based on Processing and is similar to C++ with a few java-like syntactical niceties thrown in.

The main loop is executed at about 100Hz. During each iteration of the loop the AD595 is read and summed 100 times and after 100 iterations the average is taken and the current temp is updated. I started by oversampling 100x but found that that resulted in poor accuracy and provided many more readings than I was using. After some experimentation I found that oversampling 10,000 times gave great accuracy (about +/- 0.1 deg. Farenheit) and didn’t introduce any negative side effects due to excessive smoothing of the temperature curve. Also, since the PID period is one second, oversampling 10,000 times, giving me one reading per second, is just enough for the PID to have an updated reading at each iteration.

The PID algorithm is based off of this article: PID without a PHD. One of the bigger problems I had was tuning the PID, as seems to be the case for many people. What gave me such a hard time was all the instructions on the internet about how to tune a PID. Most say to start with Ki and Kd set to 0 and adjust Kp first. They go on to say that Ki can then be set to remove any static offset from the setpoint. Kd is said to be tricky to set and unnecessary for most applications.

After having trouble with that tuning technique and observing the system in operation a bit I realized that those instuctions wouldn’t work for me. The problem with relying on Kp to correct most of the error is that you have to be below the setpoint before any correction occurs (unless you add a constant positive error term to each reading). Because the system takes a few seconds to respond, the temp dips pretty low before it starts to reverse. Likewise, as the temp is climbing towards the setpoint, proportional error correction is applied right up until hitting the setpoint, at which point too much heat has been added and the temp overshoots the setpoint considerably. Adding ANY Ki just makes things worse. It dawned on me that the system was responding way too late, so the PID needed to anticipate the future state of the system in order to be effective. The future temp depends on two things, the current temp and the rate of change of temp. After observing that the temp would climb about 15 deg. after shutting off the boiler at maximum rate of change and doing some simple math to figure out what some reasonable values for Kd and Kp should be, I came up with settings which hold the temp to within +/- 1 deg. farenheit of the setpoint, and not more than 4 or 5 deg. of over/undershoot when turning on or recovering from steam mode. Intra-shot stability is also quite good usually dipping no more than a couple of deg. farenhiet and ending back at setpoint by the end of a 30 sec. 1.75 - 2 oz. shot. This is with Kd set to an aggressive 5x Kp and Ki set to 0. I believe this can be improved with further tuning of the PID parameters.

Because the boiler is either off or on, I use a PWM (pulse width modulation) technique to control the boiler. I treat the PID output as a percentage and keep the boiler on for that percent of a second, updating every second. Anything < 0% is treated as 0% and anything > 100% is treated as 100%. I round the output to the tens so that I limit the minimum on-off cycle time of the boiler to 0.1 seconds. This is for several reasons; anything faster is really unnecessary, it wears the SSR less and it limits the amount of EM interference caused by switching 1300 watts on and off that quickly.

Results

The machine works great! I get better stability and control than I was expecting, and it came together pretty quickly with no major problems. Additionally, I still need to finish fine-tuning the PID parameters to maximize stability.

As for the Espresso, that is a whole other subject that I will dedicate a whole post to, but for now I will say that I’m getting some great results with my local San Francisco favorites — Blue Bottle Espresso Temescal, Blue Bottle Roman Espresso and Coffee to the People’s Beatnik Espresso. I’ll be picking up some Blue Bottle Hayes Valley Espresso and some Blue Bottle Retrofit Espresso shortly.

What I have left to do:

  • Fine-tune the PID parameters
  • Improve the accuracy of the temperature reading by using an equation to adjust for the slight non-linearity of the AD595
  • Fix the sometimes flaky button circuit
  • Save settings to EEPROM instead of reverting to defaults upon power cycle.
  • Add persistent settings profiles that can be chosen via the setup menu.
  • Calibrate the temp circuit
  • Correlate the boiler temp to group head temp or add secondary thermometer to group head
  • Show average error in +/- deg. F/C
  • ???

The program is still pretty beta, with some features requiring cleaning up and others needing to be implemented. When it’s a bit more finished I will post the GPL’d source code, if there is any interest.

If you thought this was interesting, check out the PID PIC MOD. It’s a surprisingly similar project with a Rancilio Sylvia and gave me some high-level ideas for my project.
Nash Lincoln

Comments (8)

« Previous entries ·