Data Strategy

July 30, 2007

Hadoop gaining momentum

Filed under: Infrastructure — chucklam @ 2:06 pm

While it’s not exactly a secret that Yahoo supports the open source project Hadoop, I think it’s only in the last week that I’ve seen an official description of Yahoo’s contribution. For those who don’t know, Hadoop is an open source implementation of Google’s MapReduce system. It provides a programming framework for processing large datasets using clusters of machines. By following the Hadoop framework, programmers are automatically given scalability and reliability on a distributed processing platform.

Besides supporting the basic project, Yahoo (through its Yahoo Research division) has also developed an abstraction interface layer on top of Hadoop called Pig. Pig allows one to express data analysis tasks in relational algebra, giving them some semblance to SQL.

Another extension to Hadoop is Hbase, a clone of Google’s Bigtable used to store structured data over a distributed system. And in case you don’t have clusters of machines lying around, Hadoop has also been extended with the ability to run on Amazon’s EC2.

Furthermore, some Stanford researchers have recently published a paper called Map-Reduce for Machine Learning on Multicore (pdf) that demonstrate how many popular machine learning algorithms (such as logistic regression, naive Bayes, SVM, neural networks, and others) can easily be adapted to the MapReduce framework for learning from large datasets. The paper is not specific to Hadoop, but it certainly points to how Hadoop can be further extended for large-scale machine learning projects.



  1. Hadoop and Pig look very interesting, it is great that they are developing the tools to allow everyone to use MapReduce paradigm.
    I think this architecture is going to be used more and more with business outside of just search as the data we are working on becomes bigger and bigger.

    As for the multicore mapreduce, yes it is interesting, but how many of us have more than 2 cores? This is something that may be useful in the future, when we all have those 80-core intel chips!

    Comment by Shane — July 30, 2007 @ 5:35 pm

  2. The multicore angle of the Stanford paper was a little intriguing to me as well. My guess is that they’ve chosen that positioning due to limited research time and resources. For the multicore approach, they can implement a much lighter version of MapReduce, not having to worry about unreliable CPUs or network (See Sect. 3 of their paper). They also don’t need to build a whole compute cluster for experimentation. Finally, they can test using readily available datasets. MapReduce on compute clusters is pretty pointless unless you have truly large datasets, which may be too inconvenient for these researchers to get a hold of.

    Comment by chucklam — July 31, 2007 @ 4:11 am

  3. […] (For more background on Hadoop and its extensions, see my blog post here.) […]

    Pingback by Yahoo! Announces Distributed Computing Academic Program « Data Strategy — November 14, 2007 @ 2:07 am

  4. […] but a side story had some interesting information about Hadoop. I had previously blogged about how Hadoop was gaining momentum among the technical community. The BusinessWeek article mentioned a couple examples of actual businesses using Hadoop for their […]

    Pingback by More example uses of Hadoop « Data Strategy — December 22, 2007 @ 12:59 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at

%d bloggers like this: