Data Strategy

October 28, 2007

Orkut on Orkut

Filed under: Network analysis, People and Data — chucklam @ 10:57 pm

Orkut Buyukkokten, founder of, will talk at Stanford today about “Who do you know: The Social Network Revolution”. The talk will be in Gates 498 at 5pm.


Online social networks fundamentally change the way we get connected.
The people we cross paths with have the biggest influence in our
lives. Now it’s easier to cross paths than ever as we are much closer
and so much more connected. In this talk I will discuss the
motivation behind the development of, touch on the social
and technical aspects of implementing and maintaining a system that
has over 60 million users and reflect on the lessons we learned.


Orkut Buyukkokten is a software engineer and product manager at
Google. He received his PhD in Computer Science from Stanford in
2002. He has been building and working on online communities the past six years. His interests include social networks, interface design
and mobile applications.

October 26, 2007

Microsoft rethinking data centers – the presentation

Filed under: Infrastructure — chucklam @ 12:43 am

Chuck Thacker from Microsoft (project lead for the Xerox Alto as well as co-inventor of Ethernet LAN) gave a talk today at Stanford about rethinking the design of datacenters. The PowerPoint presentation is here.

The presentation starts with the problem of today’s datacenters not being designed from a systems point of view. The “packaging” of datacenters is suboptimal, and he praises the approach of Sun’s black box (i.e. self-contained data center in a standardized shipping container, see slide #5). I was a bit surprised by that since well… I just didn’t quite expect Microsoft to speak well of Sun. Furthermore, I asked him what he thought of Sun’s Niagara CPU and he also said it’s a good re-thinking of CPU design, but with the caveat that not many people left are using Solaris OS.

Chuck went on to talk about using custom designs for power, computers, management, and networking that are optimized for today’s datacenters. Half the talk is about networking since that’s the audience’s interest. Surprisingly there’s only one slide on datacenter management. He claims that Microsoft is “doing pretty well here” and “opex isn’t overwhelming.” Management can be better with more sensors and use of machine learning to predict failures. He mentioned that Microsoft is approaching one admin per 5,000 machines. At that scale I do suppose further improvements may not have much financial impact.

October 21, 2007

Microsoft rethinking data centers

Filed under: Infrastructure — chucklam @ 9:51 pm

Here’s an upcoming talk at Stanford titled “Rethinking Data Centers” by Charles Thacker of Microsoft. Many pundits have claimed that the ability to build sophisticated data centers is an important strategic advantage in the future. It’ll be interesting to get Microsoft’s insight on this. And the free lunch doesn’t hurt either… 🙂

Title:  Rethinking Data Centers
Speaker: Charles Thacker,
Technical Fellow, Microsoft Research

When:  12:15 PM, Thursday October 25th, 2007
Where:  Room 101, Packard Building

Lunch will be available at 11:45AM.


Microsoft builds several data centers each year, with each center containing in excess of fifty thousand servers.  This is quite expensive, and we’d like to reduce the capital and operating expenses for these centers. This talk suggests that an effective way to do this is to treat the data center as a system, rather than focusing on the individual components such as networking and servers.  I’ll describe one way this could work, with a particular focus on the networking infrastructure.

October 20, 2007

Statistical models beating experts

Filed under: Datamining, Pattern recognition — chucklam @ 9:55 pm

Ian Ayres published an edited excerpt from his book ‘Super Crunchers: How Anything Can Be Predicted’ in the Financial Times back in August. The piece revolves around the idea that “Since the 1950s, social scientists have been comparing the predictive accuracies of number crunchers and traditional experts – and finding that statistical models consistently outpredict experts.” This is hardly news to anyone who had studied pattern recognition. While statistical models are much worse than the average person on “simple” tasks (e.g. speech recognition), they generally outperform “experts” on “intelligent” tasks (e.g. medical diagnosis).

However, Ian still manages to cite examples that are interesting but not well known (at least to me). Six years ago, two political scientists, Andrew Martin and Kevin Quinn, developed a system that uses “just a few variables concerning the politics of the case” to predict how the US Supreme Court justices would vote. As a friendly contest, that system was pitted against a panel of 83 “legal experts – esteemed law professors, practitioners and pundits who would be called upon to predict the justices’ votes for cases in their areas of expertise.” The task “was to predict in advance the votes of the individual justices for every case that was argued in the Supreme Court’s 2002 term.” The statistical system won. It had a 75% accuracy versus 59.1% from the experts.

Another interesting example is that of Orley Ashenfelter, an economist at Princeton university. He devised a formula for predicting wine quality: Wine quality = 12.145 + 0.00117 winter rainfall + 0.0614 average growing season temperature – 0.00386 harvest rainfall. It’s not clear whether that equation outperforms experts. However, the equation seems good enough that it ruffled the feathers of quite a few wine snobs 🙂

October 18, 2007

How Acxiom uses offline data for behavioral targeting

Filed under: Advertising, Data Collection, Datamining, People and Data, Personalization — chucklam @ 12:37 am

A really fascinating piece at the WSJ yesterday: Firm Mines Offline Data To Target Online Ads (subscription req.). There’s a particular side-bar that reveals how Acxiom uses offline data for behavioral targeting:

How Acxiom delivers personalized online ads:

  1. Acxiom has accumulated a database of about 133 million households and divided it into 70 demographic and lifestyle clusters based on information available from… public sources.
  2. A person gives one of Acxiom’s Web partners his address by buying something, filling out a survey or completing a contest form on one of the sites.
  3. In an eyeblink, Acxiom checks the address against its database and places a “cookie,” or small piece of tracking software, embedded with a code for that person’s demographic and behavioral cluster on his computer hard drive.
  4. When the person visits an Acxiom partner site in the future, Acxiom can use that code to determine which ads to show…
  5. Through another cookie, Acxiom tracks what consumers do on partner Web sites…

It’s an interesting approach. While Acxiom has offline information on “133 million households,” it’s not clear how many households actually have gotten the offline-tracking cookie.

One can imagine Google really taking advantage of an approach like this. The wide spread use of Google Analytics across web sites already gives Google the potential to track your surfing habit. Specifying your home address when you use Google Maps or Checkout allows Google to match you with offline marketing databases. And we haven’t even talked about your query data and your emails yet…

October 16, 2007

On building the Facebook platform

Filed under: Network analysis, People and Data — chucklam @ 4:12 am

An upcoming presentation on the design decisions behind the Facebook platform.

Building the Facebook Platform

6:30 PM – 9:00 PM October 23, 2007
Tibco Software Inc.
3301 Hillview Avenue, Building #2
Palo Alto, , CA

Presentation Title: Building the Facebook Platform
Presentation Summary: In building Facebook Platform, we tried to design a system that was incredibly powerful while still easy for developers to use. We also needed to protect our users’ privacy and ensure that the site remained valuable and enjoyable for users. In this talk, we’ll go over the technical decisions we made to accomplish these goals. We’ll talk about the design of the API, FQL, FBML, and the numerous ways in which applications can integrate with the Facebook site.

– Ari Steinberg, Facebook
– Charlie Cheever, Facebook

October 14, 2007

Interesting job posting for a click fraud analyst

Filed under: Advertising, Datamining, Pattern recognition — chucklam @ 2:09 am

This is one of the most interesting job posts I’ve come across in a while. It’s a datamining contest in which contestants with interesting approaches get to apply for an analyst position. No word on whether the contest itself has any other prize 🙂

Data Mining Contest: Uncover Criminal Activity in a Real Fraud Case

The purpose of the contest is to identify a particular type of highly sophisticated fraud that appeared in a pay-per-click advertising program. The fraud targeted one advertiser only. Participants providing an interesting answer will be invited to apply for a Sr. Fraud Analyst position with Authenticlick. The small spreadsheet with the fraudulent click data can be downloaded here. Your answers should be emailed to vlg @

The two questions are

  • Why are these clicks fraudulent?
  • What type of click fraud is it?

You can check references such as our click scoring blog to help you answer the questions.

October 12, 2007

Controlled experiments on the Web

Filed under: Datamining, People and Data, Statistical experimentation — chucklam @ 2:49 am

Ronny Kohavi of Microsoft (previously Amazon) presented a paper this year at KDD called Practical Guide to Controlled Experiments on the Web. As far as I know, it’s the first “academic” paper on what’s often called A/B testing. I say “academic” in quotes because the paper is relatively lightweight and is geared towards an audience of industry practitioners.

Most people who work on A/B testing are computer scientists who know more about systems and databases than statistics, and unfortunately this paper doesn’t do much to correct that. (And by statistics I mean a specific body of knowledge that has been accumulated over the last couple hundred years, not psuedo-scientific Web 2.0 marketese like “long tail” or gratuitous name dropping involving Gauss, Bayes, and Pareto.) However, the paper does point out some system design and usability issues that Amazon and others have learned from their experience. For example, to maintain a consistent user experience, each user must be assigned to the same experimental group on multiple visits to the site. Since maintaining state under distributed servers introduces scaling and performance issues, group assignment based on user ID hashing is the preferred approach.

Given the lack of good technical publications on doing controlled experiments on the Web, this paper is certainly a welcome start.

October 8, 2007

Jimmy Wales on “Free Culture and the Future of Search”

Filed under: Collective wisdom, Information Retrieval, Open Data, Search — chucklam @ 2:10 am

Just got word that Jimmy Wales, founder of Wikipedia and Wikia, will be speaking at The Stanford Law and Technology Association this Thursday (Oct. 11) on the topic of “Free Culture and the Future of Search.” The talk will be in Room 190, Stanford Law School, at 12:45pm. Worth checking out if you’re in the area.

October 4, 2007 has interesting analyses of Facebook and MySpace

Filed under: Network analysis, People and Data — chucklam @ 6:02 pm

I just discovered a bunch of interesting posts on analyzing Facebook and MySpace. has a proprietary collection of Internet traffic data, so much of their analysis is quite unique. Links and notes of the posts I’ve read:

  • 14 million people interacted with Facebook Applications in August
    That’s out of 22 million visitors to Facebook. In terms of activity, picture browsing (16M) and profile browsing (21M) have more visitors. The post also shows stats like average time spent per visit.
  • MySpace vs. Facebook: The Party Starter Showdown
    “…in terms of traffic, Facebook is where MySpace was a good two years ago.” The post also had an interesting breakdown of early MySpace and Facebook users.


    Very few (1%) of the early MySpace users have “abandoned” it for Facebook. In fact, as a percentage, more early Facebook users have abandoned Facebook for MySpace (6%). This contradicts the general Silicon Valley anecdote that “everyone” is leaving MySpace for Facebook.

    Granted, The charts above were made in May, before Facebook opened up their platform. But still, while Facebook has attracted a lot of developers, has those developers developed apps that attract Non-Facebook users to Facebook?

  • Top Social Networks: Facebook grows while MySpace slows
    This post provided data comparing growth rate between Facebook and MySpace. That Facebook has a higher growth rate is well reported, and honestly, not surprising. After all, they’re in different stages of growth. The interesting info from this post is the plot of Facebook usage by state. It’s surprisingly dense in the east coast.


Having access to interesting data, the way has traffic data from ISPs and toolbars, enables a lot of interesting analysis. However, clever joining of public data can also give interesting results, as my correlation of Facebook usage with high school quality shows.

Older Posts »

Create a free website or blog at