Data Strategy

December 22, 2007

More example uses of Hadoop

Filed under: Infrastructure — chucklam @ 12:59 am

I was extremely busy the last couple weeks, so there’s a lot of stuff I missed blogging about. Need to catch up now…

BusinessWeek just had an issue where the cover story was about Google and its ability to process large datasets. It’s largely a PR piece for Google, but a side story had some interesting information about Hadoop. I had previously blogged about how Hadoop was gaining momentum among the technical community. The BusinessWeek article mentioned a couple examples of actual businesses using Hadoop for their computing needs. “Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site, says Hadoop founder Doug Cutting.” In addition, “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”

Since Hadoop follows the lineage of Nutch and Lucene, the NYT case of document indexing sounds like a canonical application of Hadoop. Using Hadoop with Amazon’s cloud at such a scale is quite unique though. I wonder if NYT is also using the Amazon cloud for other tasks. 11 million articles may add to around a half terabytes of data. Having to upload all that for a one-time use is quite inefficient in terms of both time and money (as Amazon does charge for bandwidth). It wouldn’t surprise me if NYT may in fact be using Amazon S3 for its archival storage too.

Blog at