Data Strategy

June 27, 2009

Netflix prize has been won!!

Filed under: Uncategorized — chucklam @ 1:31 am

News via Geeking with Greg. Team Bellkor’s Pragmatic Chaos has achieved greater than 10% improvement on the Netflix Prize.

April 2, 2009

Amazon Elastic MapReduce, and other stuff I don’t have time to grok yet

Filed under: Infrastructure, Uncategorized — chucklam @ 4:54 am

Lots of good stuff have been coming to my attention lately.

  • Amazon just announced their Amazon Elastic MapReduce program. Sounds like the main point of this service is to simplify setting up a Hadoop cluster in the cloud, and Amazon charges you a little extra above the normal EC2 and S3 costs for this service. Not clear to me yet why people will pay the extra cost instead of running their own instance of Hadoop on EC2. I mean, you can just read Chapter 4 of my book and do this all by yourself easily 😉 I hope to look more into this service over the weekend. At the very least this is a sign that a meaningful number of Amazon Web Services’ customers are using the EC2 cloud to run Hadoop, and so Amazon decides to focus on making it easier.
  • The March issue of the IEEE Data Engineering Bulletin is a special issue on data management on cloud computing platforms. It has papers written by academics as well as from Yahoo and IBM. Haven’t had time to read it yet, but it looks like Hadoop and Amazon EC2 are mentioned a lot.
  • Just heard about the open source Sector-Sphere project, which is a system for distributed storage and computation using commodity computers. In other words, it’s an alternative framework to Hadoop but it has a lot of architectural differences. It seems to be just the work of a few academics so far. I hope to play around with it… when I can find time from work and writing the book…

March 6, 2009

Quick notes

Filed under: Uncategorized — chucklam @ 3:22 am

Stephen Wolfram, creator of Mathematica (which I used to use a lot) and author of A New Kind of Science, blogged about his current project, Wolfram|Alpha. The web site won’t fully launch till May, and his blog post is lacking in details, but the project seems to be some combination of search engine, natural language processing, and expert-curated semantic models. Definitely something to watch.

Yahoo Developer Network has a new video on how they use Hadoop to analyze and filter spam. This is part 1 of a series, and it’s just background info on the size of their spam problem and how Hadoop is more scalable than their previous DB solution. Hopefully future episodes will have more meat.

February 12, 2009

My book on Hadoop

Filed under: Uncategorized — chucklam @ 4:08 am

Posting here has been light for a while. Lately my writing time has gone to a new book on Hadoop. It will be published by Manning with the title Hadoop in Action. Yesterday Manning released it in their early access program. You can check it out and pre-order it at

January 17, 2009

Discount to Predictive Analytics World

Filed under: Datamining, People and Data — chucklam @ 4:07 pm

The first Predictive Analytics World is coming to San Francisco on Feb. 18 and 19. It has a great line-up of accomplished speakers. Usama Fayyad (of Yahoo) and Andreas Weigend (formerly of Amazon and a good friend of mine) are keynote speakers. The Netflix Progress Prize winner “BellKor in BigChaos,” whom achieved a 9.63% increase over Netflix’s baseline system, will speak about their system in a case study.

Readers of this blog can get a 15% discount to the conference if you use the registration code: datastrategypaw09. The conference promises to be a good overview of the state of the industry in data analytics.

The conference organizers are also having an online survey to examine trends in the data analytic tools and industry. You can take the survey at


November 30, 2008

Have a big, open dataset? Make it available on Amazon!

Filed under: Data Collection — chucklam @ 6:18 pm

I was using the Amazon cloud services and came across this new feature that they haven’t publicized much yet. It’s the AWS Hosted Public Data Sets. They basically host public datasets for free that any AWS EC2 instance can access. Right now they have some datasets provided by The US Census Bureau and the Bureau of Labor Statistics. An exciting one they’ll be adding soon is the annotated human genome. If you have a large dataset in public domain, scroll to the bottom of this page to let Amazon know that you want to contribute. In fact, you can already publish your dataset to Amazon’s EBS storage yourself and just make it public. The downside is that you’ll have to pay for the hosting cost, which is strictly dependent on the size of your data but is generally cheap. You’ll also have to promote the dataset’s availability yourself, as it won’t be listed on the Amazon’s Web site.

What I wish to see in the short term is that various government agencies make their data available on Amazon. For example, the USPTO should release all its patent data. These data are already paid for by the tax payers and should be freely available to the public. If you know anyone working in government agencies, tell them to look into this Amazon service.

Of course, such dataset should also just be available openly on the Web, so users don’t have to pay Amazon just to get access to the data. Storage is cheap and downloading over the Web is easy. I’m still amazed by some organizations that insist on sending their data on DVDs (yes… I’m thinking of the USPTO again…)

On a more philosophical level, I’m thinking about how Amazon’s business has expanded from selling books to publishing books and now they’re publishing data too.

October 31, 2008

Stanford engineering classes online

Filed under: Datamining, Information Retrieval — chucklam @ 3:06 am

Just found out that some of my favorite Stanford engineering classes have published their videos online for the general public. Readers of this blog may be especially interested in Chris Manning’s class on Natural Language Processing and Andrew Ng’s class on Machine Learning. In addition, having been an EE person myself, I can highly recommend Steve Boyd’s classes on Linear Systems for those who can deal with more advanced math. He teaches Introduction to Linear Dynamical Systems and Convex Optimization I/II.

A few more classes are also available. Go to the course listing page here.

October 13, 2008

Google Technology RoundTable: MapReduce

Filed under: Datamining, Infrastructure — Tags: — chucklam @ 2:27 am

Google has released a series of YouTube interviews with their lead engineers. Embedded below is one about MapReduce. The four engineers interviewed include the inventors of MapReduce. Some  quotes:

6:17 – If we haven’t had to deal with [machine] failures… we would have probably never implemented MapReduce. Because without having to support failures, the rest of the machine code is just not that complicated.

7:20 – (Interviewer) What do you feel the technology [MapReduce] isn’t applicable for?… (Sanjay Ghemawat, Google Fellow) you can always squint at [a problem] at the right way… you can usually find a way to express it as a MapReduce…, but sometimes you have to squint at things in a pretty strange way to do this… For example, suppose you want to compute the cross correlation of every single pair of web pages in terms of saying what is the similiarity… I can run a pass where I just sort of magnify the input into the cross product of the inputs and then I can apply a function on each pair in there saying how similar it is. You intermediate data will be quadratic in the size of the input, so you probably don’t want to do it that way. So you’ll have to think a bit more carefully what your intermediate data is in that case… There’s a lot of thinking at the application level if you want to use MapReduce in that scenario. [Emphasis mine]

18:14 – (Matt Austern, SW engr) One of the core implementation issues in MapReduce is how you get the intermediate data from the Mappers to the Reducers. Every Mapper writes to every Reducer, and so it ends up making very heavy use of the network… (Interviewer) If you really want to provide a lot of computing, its very easy, one would think, to just buy lots more microprocessor… but the issue is communication between them… (Jerry Zhao, SW engr) Communication is not only the limit. How to coordinate the communication channel itself is also an interesting problem.

20:17 – MapReduce was originally designed as a batch processing system for large quantity of data. But we see our users are using MapReduce for relatively small set of data but have very strict latency requirement.

This is probably besides the point, but everyone in the video except maybe Sanjay sounds really scripted and robotic…

August 29, 2008

OpenCalais: Semantic Processing as a Web Service

Filed under: Uncategorized — chucklam @ 2:57 am

I’ve recently discovered OpenCalais and found its concept to be really interesting. OpenCalais is a web service created by Thomson Reuters for extracting semantic entities from natural language text. The quickest way to understand it is to check out the demo app here. You can copy-and-paste some text into the entry box and see OpenCalais does its semantic processing on the text. For example, I pasted this sentence I just read in the New York Times, “Judy Estrin, who has built several Silicon Valley companies and was the chief technology officer of Cisco Systems, says Silicon Valley is in trouble.” The demo app picks out the references to a “Person” named “Judy Estrin” as well as a “Company” named “Cisco Systems.” In addition, it picks out a “Quotation” by person “Judy Estrin” with quote “Silicon Valley is in trouble.” It also picks out a “Person Professional Past” relationship between a person of “Judy Estrin”, a position of “chief technology officer”, and a company of “Cisco Systems.”

Now, just imagine that kind of natural language processing capability available as a Web service API, and that is OpenCalais.

The OpenCalais team will be presenting at various events in September, being in Palo Alto on the 3rd and San Francisco on the 4th.

August 1, 2008

“Collaborative filtering” help drive Digg usage

Filed under: Datamining, Information Retrieval, People and Data, Personalization — chucklam @ 3:25 am

Digg released their “collaborative filtering” system a month ago. Now they’ve blogged about some of the initial results. While it’s an obviously biased point of view, things look really good in general.

  • “Digging activity is up significantly: the total number of Diggs increased 40% after launch.”
  • “Friend activity/friends added is up 24%.”
  • “Commenting is up 11% since launch.”

What I find particularly interesting here is the historical road that “collaborative filtering” has taken. The term “collaborative filtering” was first coined by Xerox PARC more than 15 years ago. Researchers at PARC had a system called Tapestry. It allowed users to “collaborate to help one another perform filtering by recording their reactions to documents they read.” This, in fact, was a precursor to today’s Digg and Delicious.

Soon after PARC created Tapestry, automated collaborative filtering (ACF) was invented. The emphasis was to automate everything and make its usage effortless. Votes were implied by purchasing or other behavior, and recommendation was computed in a “people like you have also bought” style. This style of recommendation was so successful at the time that it had completely taken over the term “collaborative filtering” ever since.

In the Web 2.0 wave, companies like Digg and Delicious revived the Tapestry-style of collaborative filtering. (Although I’d be surprised if those companies had done so as a conscious effort.) They were in a sense stripped-down versions of Tapestry, blow up to web scale, and made extremely easy to use. (The original Tapestry required one to write database-like queries.)

Now Digg, which one can think of as Tapestry 2.0, is adding ACF back into its style of recommendation and getting extremely positive results. Everything seems to have moved forward, and at the same time it seems to have come full circle.

Older Posts »

Blog at