While it’s not exactly a secret that Yahoo supports the open source project Hadoop, I think it’s only in the last week that I’ve seen an official description of Yahoo’s contribution. For those who don’t know, Hadoop is an open source implementation of Google’s MapReduce system. It provides a programming framework for processing large datasets using clusters of machines. By following the Hadoop framework, programmers are automatically given scalability and reliability on a distributed processing platform.
Besides supporting the basic project, Yahoo (through its Yahoo Research division) has also developed an abstraction interface layer on top of Hadoop called Pig. Pig allows one to express data analysis tasks in relational algebra, giving them some semblance to SQL.
Another extension to Hadoop is Hbase, a clone of Google’s Bigtable used to store structured data over a distributed system. And in case you don’t have clusters of machines lying around, Hadoop has also been extended with the ability to run on Amazon’s EC2.
Furthermore, some Stanford researchers have recently published a paper called Map-Reduce for Machine Learning on Multicore (pdf) that demonstrate how many popular machine learning algorithms (such as logistic regression, naive Bayes, SVM, neural networks, and others) can easily be adapted to the MapReduce framework for learning from large datasets. The paper is not specific to Hadoop, but it certainly points to how Hadoop can be further extended for large-scale machine learning projects.