March 14, 2008

Seeing Netflix data as more than just a bunch of numbers

It’s a truism among dataminers that analyzing certain data can help us understand people. However, dataminers rarely see that psychology, the discipline of understanding people, can help get more value out of such data. It was recently reported that one of the top ten contestants in the Netflix Prize approached the challenge from a psychologist’s point of view rather than from a computer scientist’s.

For example, people don’t often give their “true” rating on movies. Instead, they can be biased by anchoring. That is, their rating of a movie is influenced by the ratings they had just given earlier for other movies. Adjusting for biases such as this is how Gavin Potter, aka “Just a guy in a garage,” got to be number 9 in the Netflix Prize leaderboard.

November 14, 2007

reCAPTCHA gone awry…

I think reCAPTCHA is a very clever idea to layer data collection on top of an authentication system. However, sometimes the security check is just a bit too puzzling. I came across this today on Facebook. How am I suppose to type the answer in?? 😉


October 20, 2007

Statistical models beating experts

Ian Ayres published an edited excerpt from his book ‘Super Crunchers: How Anything Can Be Predicted’ in the Financial Times back in August. The piece revolves around the idea that “Since the 1950s, social scientists have been comparing the predictive accuracies of number crunchers and traditional experts – and finding that statistical models consistently outpredict experts.” This is hardly news to anyone who had studied pattern recognition. While statistical models are much worse than the average person on “simple” tasks (e.g. speech recognition), they generally outperform “experts” on “intelligent” tasks (e.g. medical diagnosis).

However, Ian still manages to cite examples that are interesting but not well known (at least to me). Six years ago, two political scientists, Andrew Martin and Kevin Quinn, developed a system that uses “just a few variables concerning the politics of the case” to predict how the US Supreme Court justices would vote. As a friendly contest, that system was pitted against a panel of 83 “legal experts – esteemed law professors, practitioners and pundits who would be called upon to predict the justices’ votes for cases in their areas of expertise.” The task “was to predict in advance the votes of the individual justices for every case that was argued in the Supreme Court’s 2002 term.” The statistical system won. It had a 75% accuracy versus 59.1% from the experts.

Another interesting example is that of Orley Ashenfelter, an economist at Princeton university. He devised a formula for predicting wine quality: Wine quality = 12.145 + 0.00117 winter rainfall + 0.0614 average growing season temperature – 0.00386 harvest rainfall. It’s not clear whether that equation outperforms experts. However, the equation seems good enough that it ruffled the feathers of quite a few wine snobs 🙂

October 14, 2007

Interesting job posting for a click fraud analyst

This is one of the most interesting job posts I’ve come across in a while. It’s a datamining contest in which contestants with interesting approaches get to apply for an analyst position. No word on whether the contest itself has any other prize 🙂

Data Mining Contest: Uncover Criminal Activity in a Real Fraud Case

The purpose of the contest is to identify a particular type of highly sophisticated fraud that appeared in a pay-per-click advertising program. The fraud targeted one advertiser only. Participants providing an interesting answer will be invited to apply for a Sr. Fraud Analyst position with Authenticlick. The small spreadsheet with the fraudulent click data can be downloaded here. Your answers should be emailed to vlg @

The two questions are

  • Why are these clicks fraudulent?
  • What type of click fraud is it?

You can check references such as our click scoring blog to help you answer the questions.

August 17, 2007

Fight crime with datamining

Some police departments are using datamining to predict where and when crimes are more likely to occur.

Using some sophisticated software and hardware [the Richmond, Va. police service] started overlaying crime reports with other data, such as weather, traffic, sports events and paydays for large employers. The data was analyzed three times a day and something interesting emerged: Robberies spiked on paydays near cheque cashing storefronts in specific neighbourhoods. Other clusters also became apparent, and pretty soon police were deploying resources in advance and predicting where crime was most likely to occur.

Coupled with some other technological advancement, such as surveillance videos wirelessly transmitted to patrol cars, major crime rates dropped 21 per cent from 2005 to 2006. In 2007, major crime is down another 19 per cent.

August 11, 2007

David Heckerman interview

A little over a month ago CNet published an interview with David Heckerman, lead researcher of Microsoft’s Machine Learning and Applied Statistics Group. I haven’t heard much about David lately, and apparently he’s been busy developing open source analytical tools for HIV research.

Back in 1990, David had won the ACM Doctoral Dissertation Award for his thesis “Probabilistic Similarity Networks”. He was a major influence in the ’90s in establishing Bayesian networks and Bayesian methodologies as practical tools in AI and machine learning. I remember my adviser lending me a copy of David’s thesis when I first started my PhD study. He told me to read it and to strive for a thesis of similar caliber. (Yes… the first couple years of graduate study tend to be filled with optimism and ambition.) From David’s thesis I had seen the potential of quality research.

In the last five years or so, I haven’t come across any publication by David Heckerman. It’s great to learn that he’s still doing great work, just now in a slightly different field.

July 11, 2007

Speech recognition market exploding

There’s an article in todays Wall Street Journal (subscription required) on the growing market in speech recognition technologies. According to Datamonitor, sales of such technologes are growing at 22% a year. Nuance Communications dominates this market, and they’re not likely to lose this edge any time soon. The reason?

Perfecting speech-recognition technology requires collecting and studying large amounts of utterances and accents. Bill Meisel, editor of Speech Strategy News, an industry publication, estimates that it would take competitors some time to catch up with Nuance in more-advanced versions. “They certainly have more real-world data than anyone else,” he says.

So while data alone didn’t build this $3B company, it certainly was a big part of its strategy.

