Data Strategy

July 10, 2008

Conference on Cloud Computing (July 19)

Filed under: Infrastructure — chucklam @ 4:22 am

Cloud Computing-the New Face of Computing-Promises and Challenges

9th IEEE/NATEA Annual Conference
2008 New Frontiers in Computing Technology

Date: July 19, 2008 (Saturday)
Location: Cubberley Auditorium at Stanford University

IEEE Computer Society – Santa Clara Valley Chapter (
IEEE Stanford – Student Chapter (
North America Taiwanese Engineers’ Association (

Registion site:
The registration fee is $65 for regular, $60 for members, and $30 for students/unemployed

Cloud Computing denotes the latest trend in application development for Internet services, relying on clouds of servers to handle tasks that used to be managed by individual machines. With Cloud Computing, developers take important services, such as email, calendars, and word processing, and host them entirely online, powered by a vast array (or cloud) of interdependent commodity servers. Cloud Computing presents advantages for organizations seeking to centralize the management of software and data storage, with guarantees on reliability and security for their users. Recently, we have seen many efforts of the commercialization of the cloud, such as Amazon’s EC2/S3/SimpleDB, Google‘s App Engine, Microsoft’s SQL Server data services and IBM’s “Blue Cloud” service. At the same time, open source projects such as Hadoop and ZooKeeper offer various software components that are essential for building a cloud infrastructure. We hope to bring together eminent researchers and practiti!
oners from key research labs, companies, and open source communities to give us a quick overview of cloud computing. In addition, these speakers will present their views on the opportunities and challenges of cloud computing, either from technology aspect or business aspect.

Hamid Pirahesh, IBM Almaden Research, Keynote Talk, “Impact of Cloud Computing on Emerging Software System Architecture and Solutions”
Jimmy Lin, University of Maryland at College Park, “Scalable Text Processing with MapReduce”
Jim Rivera,, “Platform as a Service: Changing the Economics of Innovation”
Joydeep Sen Sarma and Ashish Thusoo, Facebook, “Hive: Datawarehousing and Analytics on Hadoop”
Hairong Kuang, Yahoo, “Take an internal look at Hadoop”
Mano Marks, Google, “App Engine: Building a Scalable Web Application on Google’s infrastructure”
Kevin Beyer, IBM Almaden Research, “Jaql: Querying JSON data on Hadoop”
Mihai Budiu, Microsoft Research in Silicon Valley, “DryadLINQ – a language for data-parallel computation on computer clusters”
Jinesh Varia, Evangelist, Amazon Web Services, “Cloud Architectures – New way to design architectures by building it in the cloud”

If you have questions on this event, reply to Howard Ho,, or Eric Louie,

IEEE Computer Society
Santa Clara Chapter
Eric Louie


June 26, 2008

Subtlety in measuring Myspace vs. Facebook int’l traffic

Filed under: People and Data — chucklam @ 2:58 pm

Lately I’ve been seeing different blog posts about how Facebook has overtaken Myspace in international traffic. For example, last week Andrew Chen used Google Trends to find that Facebook is more popular than Myspace in a number of major countries. As an example, he plotted a graph for Australia and showed that Facebook beat Myspace around Oct. ’07.

I tried to create a similar graph for China, and sure enough, Facebook has been beating Myspace, right from the beginning.

The only problem is that the analysis is wrong is this case.

I was at a conference in China last summer, and while I was chilling with a Tsingtao in the hotel room, I noticed that Myspace was doing heavy advertisement on the equivalent of MTV.

And the thing was, they weren’t promoting They were promoting

As for Facebook, as far as I know, they don’t have a Chinese version. returns an error while just forwards to

Redoing the analysis, we compare the traffic of versus We see that Myspace had actually pulled away from Facebook around Oct ’07 and had been maintaining the lead since.

Furthermore, Facebook’s China users are concentrated around Shanghai and Beijing, and they search for things like ‘shanghai map’, ‘beijing airport’, and ‘beijing map’. In other words, they’re expats.

Basically, Myspace had created a site in Chinese and was trying to build up a domestic user base. Facebook, on the other hand, is relying on the spread of American influence in other countries.

My impression is that Facebook is getting all its traffic at, but Myspace is more inconsistent. I took a quick look for Australia and it seems like is the official homepage. In that case, Andrew’s chart for Australia is accurate. I haven’t looked at other countries, but my point is that this kind of analysis is pretty subtle. Just looking at Google charts without knowing the context can be misleading sometimes.

One last note. The battle between Myspace and Facebook is moot when you factor in the really big social networks in China. This is what happens when is included in the graph.

April 19, 2008

Amazon CTO to give talks on Internet infrastructure

Filed under: Infrastructure — chucklam @ 6:15 pm

Werner Vogels,’s CTO, are giving two talks at Stanford on Tuesday and Wednesday. I hope to attend the Wednesday talk on lessons learned from building Amazon’s infrastructure. The talk’s “focus will be on state management which is one of the dominating factors in the scalability, reliability, performance and cost-effectiveness of the overall system.” More details below:

Stanford EE Computer Systems Colloquium

4:15PM, Wednesday, April 23, 2008
HP Auditorium, Gates Computer Science Building B01

A Head in the Cloud – The Power of Infrastructure as a Service

Werner Vogels

About the talk:Building the right infrastructure that can scale up or down at a moment’s notice can be a complicated and expensive task, but it’s essential in today’s business landscape. This applies to an enterprise trying to cut-costs, a young business unexpectedly saturated with customer demand, or a start-up looking to launch. There are many challenges when building a reliable, flexible architecture that can manage unpredictable behaviors of today’s internet business. This presentation will review some of the lessons learned from building one of the world’s largest distributed systems; The focus will be on state management which is one of the dominating factors in the scalability, reliability, performance and cost-effectiveness of the overall system.

Also of interest

Werner Vogel will also speak on “Distributed Cloud Computing” in the Clean Slate Seminar on April 22, 2008 4-5PM in Packard 101. The Clean Slate Seminar (CS541) is part of the Clean Slate Internet Design research program that is aimed at addressing two broad and ambitious research questions: “With what we know today and if we were to start again with a clean-slate, how would we design a global communications infrastructure?” and “How should the Internet look in 15 years?”.

March 14, 2008

Seeing Netflix data as more than just a bunch of numbers

It’s a truism among dataminers that analyzing certain data can help us understand people. However, dataminers rarely see that psychology, the discipline of understanding people, can help get more value out of such data. It was recently reported that one of the top ten contestants in the Netflix Prize approached the challenge from a psychologist’s point of view rather than from a computer scientist’s.

For example, people don’t often give their “true” rating on movies. Instead, they can be biased by anchoring. That is, their rating of a movie is influenced by the ratings they had just given earlier for other movies. Adjusting for biases such as this is how Gavin Potter, aka “Just a guy in a garage,” got to be number 9 in the Netflix Prize leaderboard.

February 20, 2008

First Hadoop Summit to be held at Yahoo

Filed under: Infrastructure — chucklam @ 4:00 pm

On March 25. See announcement here. This follows an announcement by Yahoo that they’ve deployed a Hadoop application that runs on 10,000 core Linux cluster.

January 20, 2008

Datamining voters

Filed under: Data Collection, Datamining, People and Data, Privacy — chucklam @ 11:33 pm

Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.

A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”

“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”

Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.

January 17, 2008

The Google Online Marketing Challenge

Filed under: Advertising — chucklam @ 11:39 pm

Google is encouraging universities to teach their students “online” marketing (i.e. how to use AdWords) by hosting an Online Marketing Challenge:

Here’s how the Challenge works: Your students will receive US$ 200 in Google ads to drive traffic to a business website of their choosing. Students will compete with groups from their institution along with student teams from all over the world.

This is not a simulation; students gain real-world experience with a real client. You have a great student project; clients gain free advertising and Internet consulting. Google provides US $200 in vouchers, teaching materials and other resources.

Encryption versus the Fifth Amendment

Filed under: People and Data — chucklam @ 3:54 am

As usual, Nick Carr has spotted some interesting news:

As the Washington Post reports today, the encryption conflict is now coming to a head. [Sebastien Boucher], accused of storing child pornography on his computer, has refused to provide police with the password required to unlock the encrypted files on his hard drive. He claims that disclosing the password would violate his Fifth Amendment right to avoid self-incrimination. A judge backed his claim, and the government is now appealing that ruling in the federal courts.

The judge’s reasoning, as cited in the original Washington Post article, is quite fascinating:

In his ruling, [Judge] Niedermeier said forcing Boucher to enter his password would be like asking him to reveal the combination to a safe. The government can force a person to give up the key to a safe because a key is physical, not in a person’s mind. But a person cannot be compelled to give up a safe combination because that would “convey the contents of one’s mind,” which is a “testimonial” act protected by the Fifth Amendment, Niedermeier said .

The judge also said that “If Boucher does know the password, he would be faced with the forbidden trilemma: incriminate himself, lie under oath, or find himself in contempt of court.”

I’ve heard of many dilemmas, but trilemma… that’s a new one for me.

January 15, 2008

Jon Kleinberg to speak at Stanford tomorrow

Filed under: Datamining, Network analysis, People and Data, Privacy — chucklam @ 5:13 pm

Date & time: Wednesday, January 16 4:35-5:45 pm
Location: 380-380C
Speaker: Jon Kleinberg, Cornell University

Title: Computational Perspectives on Large-Scale Social Network Data


The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in
formulating models of social processes and in managing complex networks as datasets.

We consider two lines of research within this general theme. The first is concerned with modeling the flow of information through a large network: the spread of new ideas, technologies, opinions, fads, and rumors can be viewed as unfolding with the dynamics of epidemic, cascading from one individual to another through the network. This suggests a basis for computational models of such phenomena, with the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving.

The second line of research we consider is concerned with the privacy implications of large network datasets. An increasing amount of social network research focuses on datasets obtained by measuring the interactions among individuals who have strong expectations of privacy. To preserve privacy in such instances, the datasets are typically anonymized — the names are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed. Unfortunately, there are fundamental limitations on the power of network anonymization to preserve privacy; we will discuss some of these limitations (formulated in joint work with Lars Backstrom and Cynthia Dwork) and some of their broader implications.

Speaker bio:

Jon Kleinberg is a Professor in the Department of Computer Science at Cornell University. His research interests are centered around issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a Fellow of the American Academy of Arts and Sciences, and the recipient of MacArthur, Packard, and Sloan Foundation Fellowships, the Nevanlinna Prize from the International Mathematical Union, and the National Academy of Sciences Award for Initiatives in Research. – for people who love large data sets

Filed under: Data Collection, Datamining, Infrastructure — chucklam @ 4:50 pm

Aaron Swartz just announced a new Web site ( he created for people who love large data sets, “the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.”

The site is very spartan right now, but it can certainly become very interesting if it attracts the right contributors. I hope he succeeds in building a community around it.

« Newer PostsOlder Posts »

Blog at