Data Strategy

April 2, 2009

Amazon Elastic MapReduce, and other stuff I don’t have time to grok yet

Filed under: Infrastructure, Uncategorized — chucklam @ 4:54 am

Lots of good stuff have been coming to my attention lately.

  • Amazon just announced their Amazon Elastic MapReduce program. Sounds like the main point of this service is to simplify setting up a Hadoop cluster in the cloud, and Amazon charges you a little extra above the normal EC2 and S3 costs for this service. Not clear to me yet why people will pay the extra cost instead of running their own instance of Hadoop on EC2. I mean, you can just read Chapter 4 of my book and do this all by yourself easily 😉 I hope to look more into this service over the weekend. At the very least this is a sign that a meaningful number of Amazon Web Services’ customers are using the EC2 cloud to run Hadoop, and so Amazon decides to focus on making it easier.
  • The March issue of the IEEE Data Engineering Bulletin is a special issue on data management on cloud computing platforms. It has papers written by academics as well as from Yahoo and IBM. Haven’t had time to read it yet, but it looks like Hadoop and Amazon EC2 are mentioned a lot.
  • Just heard about the open source Sector-Sphere project, which is a system for distributed storage and computation using commodity computers. In other words, it’s an alternative framework to Hadoop but it has a lot of architectural differences. It seems to be just the work of a few academics so far. I hope to play around with it… when I can find time from work and writing the book…

October 13, 2008

Google Technology RoundTable: MapReduce

Filed under: Datamining, Infrastructure — Tags: — chucklam @ 2:27 am

Google has released a series of YouTube interviews with their lead engineers. Embedded below is one about MapReduce. The four engineers interviewed include the inventors of MapReduce. Some  quotes:

6:17 – If we haven’t had to deal with [machine] failures… we would have probably never implemented MapReduce. Because without having to support failures, the rest of the machine code is just not that complicated.

7:20 – (Interviewer) What do you feel the technology [MapReduce] isn’t applicable for?… (Sanjay Ghemawat, Google Fellow) you can always squint at [a problem] at the right way… you can usually find a way to express it as a MapReduce…, but sometimes you have to squint at things in a pretty strange way to do this… For example, suppose you want to compute the cross correlation of every single pair of web pages in terms of saying what is the similiarity… I can run a pass where I just sort of magnify the input into the cross product of the inputs and then I can apply a function on each pair in there saying how similar it is. You intermediate data will be quadratic in the size of the input, so you probably don’t want to do it that way. So you’ll have to think a bit more carefully what your intermediate data is in that case… There’s a lot of thinking at the application level if you want to use MapReduce in that scenario. [Emphasis mine]

18:14 – (Matt Austern, SW engr) One of the core implementation issues in MapReduce is how you get the intermediate data from the Mappers to the Reducers. Every Mapper writes to every Reducer, and so it ends up making very heavy use of the network… (Interviewer) If you really want to provide a lot of computing, its very easy, one would think, to just buy lots more microprocessor… but the issue is communication between them… (Jerry Zhao, SW engr) Communication is not only the limit. How to coordinate the communication channel itself is also an interesting problem.

20:17 – MapReduce was originally designed as a batch processing system for large quantity of data. But we see our users are using MapReduce for relatively small set of data but have very strict latency requirement.

This is probably besides the point, but everyone in the video except maybe Sanjay sounds really scripted and robotic…

July 10, 2008

Conference on Cloud Computing (July 19)

Filed under: Infrastructure — chucklam @ 4:22 am

Cloud Computing-the New Face of Computing-Promises and Challenges

9th IEEE/NATEA Annual Conference
2008 New Frontiers in Computing Technology

Date: July 19, 2008 (Saturday)
Location: Cubberley Auditorium at Stanford University

IEEE Computer Society – Santa Clara Valley Chapter (
IEEE Stanford – Student Chapter (
North America Taiwanese Engineers’ Association (

Registion site:
The registration fee is $65 for regular, $60 for members, and $30 for students/unemployed

Cloud Computing denotes the latest trend in application development for Internet services, relying on clouds of servers to handle tasks that used to be managed by individual machines. With Cloud Computing, developers take important services, such as email, calendars, and word processing, and host them entirely online, powered by a vast array (or cloud) of interdependent commodity servers. Cloud Computing presents advantages for organizations seeking to centralize the management of software and data storage, with guarantees on reliability and security for their users. Recently, we have seen many efforts of the commercialization of the cloud, such as Amazon’s EC2/S3/SimpleDB, Google‘s App Engine, Microsoft’s SQL Server data services and IBM’s “Blue Cloud” service. At the same time, open source projects such as Hadoop and ZooKeeper offer various software components that are essential for building a cloud infrastructure. We hope to bring together eminent researchers and practiti!
oners from key research labs, companies, and open source communities to give us a quick overview of cloud computing. In addition, these speakers will present their views on the opportunities and challenges of cloud computing, either from technology aspect or business aspect.

Hamid Pirahesh, IBM Almaden Research, Keynote Talk, “Impact of Cloud Computing on Emerging Software System Architecture and Solutions”
Jimmy Lin, University of Maryland at College Park, “Scalable Text Processing with MapReduce”
Jim Rivera,, “Platform as a Service: Changing the Economics of Innovation”
Joydeep Sen Sarma and Ashish Thusoo, Facebook, “Hive: Datawarehousing and Analytics on Hadoop”
Hairong Kuang, Yahoo, “Take an internal look at Hadoop”
Mano Marks, Google, “App Engine: Building a Scalable Web Application on Google’s infrastructure”
Kevin Beyer, IBM Almaden Research, “Jaql: Querying JSON data on Hadoop”
Mihai Budiu, Microsoft Research in Silicon Valley, “DryadLINQ – a language for data-parallel computation on computer clusters”
Jinesh Varia, Evangelist, Amazon Web Services, “Cloud Architectures – New way to design architectures by building it in the cloud”

If you have questions on this event, reply to Howard Ho,, or Eric Louie,

IEEE Computer Society
Santa Clara Chapter
Eric Louie

April 19, 2008

Amazon CTO to give talks on Internet infrastructure

Filed under: Infrastructure — chucklam @ 6:15 pm

Werner Vogels,’s CTO, are giving two talks at Stanford on Tuesday and Wednesday. I hope to attend the Wednesday talk on lessons learned from building Amazon’s infrastructure. The talk’s “focus will be on state management which is one of the dominating factors in the scalability, reliability, performance and cost-effectiveness of the overall system.” More details below:

Stanford EE Computer Systems Colloquium

4:15PM, Wednesday, April 23, 2008
HP Auditorium, Gates Computer Science Building B01

A Head in the Cloud – The Power of Infrastructure as a Service

Werner Vogels

About the talk:Building the right infrastructure that can scale up or down at a moment’s notice can be a complicated and expensive task, but it’s essential in today’s business landscape. This applies to an enterprise trying to cut-costs, a young business unexpectedly saturated with customer demand, or a start-up looking to launch. There are many challenges when building a reliable, flexible architecture that can manage unpredictable behaviors of today’s internet business. This presentation will review some of the lessons learned from building one of the world’s largest distributed systems; The focus will be on state management which is one of the dominating factors in the scalability, reliability, performance and cost-effectiveness of the overall system.

Also of interest

Werner Vogel will also speak on “Distributed Cloud Computing” in the Clean Slate Seminar on April 22, 2008 4-5PM in Packard 101. The Clean Slate Seminar (CS541) is part of the Clean Slate Internet Design research program that is aimed at addressing two broad and ambitious research questions: “With what we know today and if we were to start again with a clean-slate, how would we design a global communications infrastructure?” and “How should the Internet look in 15 years?”.

February 20, 2008

First Hadoop Summit to be held at Yahoo

Filed under: Infrastructure — chucklam @ 4:00 pm

On March 25. See announcement here. This follows an announcement by Yahoo that they’ve deployed a Hadoop application that runs on 10,000 core Linux cluster.

January 15, 2008 – for people who love large data sets

Filed under: Data Collection, Datamining, Infrastructure — chucklam @ 4:50 pm

Aaron Swartz just announced a new Web site ( he created for people who love large data sets, “the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.”

The site is very spartan right now, but it can certainly become very interesting if it attracts the right contributors. I hope he succeeds in building a community around it.

December 22, 2007

More example uses of Hadoop

Filed under: Infrastructure — chucklam @ 12:59 am

I was extremely busy the last couple weeks, so there’s a lot of stuff I missed blogging about. Need to catch up now…

BusinessWeek just had an issue where the cover story was about Google and its ability to process large datasets. It’s largely a PR piece for Google, but a side story had some interesting information about Hadoop. I had previously blogged about how Hadoop was gaining momentum among the technical community. The BusinessWeek article mentioned a couple examples of actual businesses using Hadoop for their computing needs. “Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site, says Hadoop founder Doug Cutting.” In addition, “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”

Since Hadoop follows the lineage of Nutch and Lucene, the NYT case of document indexing sounds like a canonical application of Hadoop. Using Hadoop with Amazon’s cloud at such a scale is quite unique though. I wonder if NYT is also using the Amazon cloud for other tasks. 11 million articles may add to around a half terabytes of data. Having to upload all that for a one-time use is quite inefficient in terms of both time and money (as Amazon does charge for bandwidth). It wouldn’t surprise me if NYT may in fact be using Amazon S3 for its archival storage too.

November 14, 2007

Yahoo! Announces Distributed Computing Academic Program

Filed under: Infrastructure — chucklam @ 2:06 am

Story via Read/WriteWeb.

Yahoo!… announced an academic research partnership with Carnegie Mellon University that will give students access to Hadoop and other open source tools running in a supercomputing-class data center. The data center… is a 4,000-processor cluster supercomputer with 3 terabytes of memory and 1.5 petabytes of diskspace… CMU and Yahoo! also plan to hold a Hadoop Summit in the first half of 2008…

This is so cool. I just hope the summit will be out here in Silicon Valley rather than at CMU.

(For more background on Hadoop and its extensions, see my blog post here.)

October 26, 2007

Microsoft rethinking data centers – the presentation

Filed under: Infrastructure — chucklam @ 12:43 am

Chuck Thacker from Microsoft (project lead for the Xerox Alto as well as co-inventor of Ethernet LAN) gave a talk today at Stanford about rethinking the design of datacenters. The PowerPoint presentation is here.

The presentation starts with the problem of today’s datacenters not being designed from a systems point of view. The “packaging” of datacenters is suboptimal, and he praises the approach of Sun’s black box (i.e. self-contained data center in a standardized shipping container, see slide #5). I was a bit surprised by that since well… I just didn’t quite expect Microsoft to speak well of Sun. Furthermore, I asked him what he thought of Sun’s Niagara CPU and he also said it’s a good re-thinking of CPU design, but with the caveat that not many people left are using Solaris OS.

Chuck went on to talk about using custom designs for power, computers, management, and networking that are optimized for today’s datacenters. Half the talk is about networking since that’s the audience’s interest. Surprisingly there’s only one slide on datacenter management. He claims that Microsoft is “doing pretty well here” and “opex isn’t overwhelming.” Management can be better with more sensors and use of machine learning to predict failures. He mentioned that Microsoft is approaching one admin per 5,000 machines. At that scale I do suppose further improvements may not have much financial impact.

October 21, 2007

Microsoft rethinking data centers

Filed under: Infrastructure — chucklam @ 9:51 pm

Here’s an upcoming talk at Stanford titled “Rethinking Data Centers” by Charles Thacker of Microsoft. Many pundits have claimed that the ability to build sophisticated data centers is an important strategic advantage in the future. It’ll be interesting to get Microsoft’s insight on this. And the free lunch doesn’t hurt either… 🙂

Title:  Rethinking Data Centers
Speaker: Charles Thacker,
Technical Fellow, Microsoft Research

When:  12:15 PM, Thursday October 25th, 2007
Where:  Room 101, Packard Building

Lunch will be available at 11:45AM.


Microsoft builds several data centers each year, with each center containing in excess of fifty thousand servers.  This is quite expensive, and we’d like to reduce the capital and operating expenses for these centers. This talk suggests that an effective way to do this is to treat the data center as a system, rather than focusing on the individual components such as networking and servers.  I’ll describe one way this could work, with a particular focus on the networking infrastructure.

Older Posts »

Blog at