I had previously written a post pointing to a Peter Norvig (Director of Research at Google) article on how to write a statistical spelling corrector. My post noted that Peter’s article didn’t explain how spelling suggestions by search engines (such as Google) learn from query logs. Well, it turns out that Silviu Cucerzan and Eric Brill over at Microsoft had already published a great paper in 2004 at EMNLP called Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users (pdf). It explains some of the unique challenges of spelling correction in search queries. For example, new and unique (but correct) terms become query terms all the time, so one can’t just construct a dictionary of correct spellings. Simple frequency counting also doesn’t work as certain misspelled queries (“britny spears”) occur very often. A misspelled query may be composed of correctly spelled terms (“golf war”). Fortunately, Silviu and Eric show how clever use of the query log can overcome these and other problems.
September 6, 2007
Spelling corrector that learns from query logs