Data Strategy

October 31, 2008

Stanford engineering classes online

Filed under: Datamining, Information Retrieval — chucklam @ 3:06 am

Just found out that some of my favorite Stanford engineering classes have published their videos online for the general public. Readers of this blog may be especially interested in Chris Manning’s class on Natural Language Processing and Andrew Ng’s class on Machine Learning. In addition, having been an EE person myself, I can highly recommend Steve Boyd’s classes on Linear Systems for those who can deal with more advanced math. He teaches Introduction to Linear Dynamical Systems and Convex Optimization I/II.

A few more classes are also available. Go to the course listing page here.

October 13, 2008

Google Technology RoundTable: MapReduce

Filed under: Datamining, Infrastructure — Tags: — chucklam @ 2:27 am

Google has released a series of YouTube interviews with their lead engineers. Embedded below is one about MapReduce. The four engineers interviewed include the inventors of MapReduce. SomeĀ  quotes:

6:17 – If we haven’t had to deal with [machine] failures… we would have probably never implemented MapReduce. Because without having to support failures, the rest of the machine code is just not that complicated.

7:20 – (Interviewer) What do you feel the technology [MapReduce] isn’t applicable for?… (Sanjay Ghemawat, Google Fellow) you can always squint at [a problem] at the right way… you can usually find a way to express it as a MapReduce…, but sometimes you have to squint at things in a pretty strange way to do this… For example, suppose you want to compute the cross correlation of every single pair of web pages in terms of saying what is the similiarity… I can run a pass where I just sort of magnify the input into the cross product of the inputs and then I can apply a function on each pair in there saying how similar it is. You intermediate data will be quadratic in the size of the input, so you probably don’t want to do it that way. So you’ll have to think a bit more carefully what your intermediate data is in that case… There’s a lot of thinking at the application level if you want to use MapReduce in that scenario. [Emphasis mine]

18:14 – (Matt Austern, SW engr) One of the core implementation issues in MapReduce is how you get the intermediate data from the Mappers to the Reducers. Every Mapper writes to every Reducer, and so it ends up making very heavy use of the network… (Interviewer) If you really want to provide a lot of computing, its very easy, one would think, to just buy lots more microprocessor… but the issue is communication between them… (Jerry Zhao, SW engr) Communication is not only the limit. How to coordinate the communication channel itself is also an interesting problem.

20:17 – MapReduce was originally designed as a batch processing system for large quantity of data. But we see our users are using MapReduce for relatively small set of data but have very strict latency requirement.

This is probably besides the point, but everyone in the video except maybe Sanjay sounds really scripted and robotic…

Create a free website or blog at