Data Strategy

January 17, 2009

Discount to Predictive Analytics World

Filed under: Datamining, People and Data — chucklam @ 4:07 pm

The first Predictive Analytics World is coming to San Francisco on Feb. 18 and 19. It has a great line-up of accomplished speakers. Usama Fayyad (of Yahoo) and Andreas Weigend (formerly of Amazon and a good friend of mine) are keynote speakers. The Netflix Progress Prize winner “BellKor in BigChaos,” whom achieved a 9.63% increase over Netflix’s baseline system, will speak about their system in a case study.

Readers of this blog can get a 15% discount to the conference if you use the registration code: datastrategypaw09. The conference promises to be a good overview of the state of the industry in data analytics.

The conference organizers are also having an online survey to examine trends in the data analytic tools and industry. You can take the survey at


August 1, 2008

“Collaborative filtering” help drive Digg usage

Filed under: Datamining, Information Retrieval, People and Data, Personalization — chucklam @ 3:25 am

Digg released their “collaborative filtering” system a month ago. Now they’ve blogged about some of the initial results. While it’s an obviously biased point of view, things look really good in general.

  • “Digging activity is up significantly: the total number of Diggs increased 40% after launch.”
  • “Friend activity/friends added is up 24%.”
  • “Commenting is up 11% since launch.”

What I find particularly interesting here is the historical road that “collaborative filtering” has taken. The term “collaborative filtering” was first coined by Xerox PARC more than 15 years ago. Researchers at PARC had a system called Tapestry. It allowed users to “collaborate to help one another perform filtering by recording their reactions to documents they read.” This, in fact, was a precursor to today’s Digg and Delicious.

Soon after PARC created Tapestry, automated collaborative filtering (ACF) was invented. The emphasis was to automate everything and make its usage effortless. Votes were implied by purchasing or other behavior, and recommendation was computed in a “people like you have also bought” style. This style of recommendation was so successful at the time that it had completely taken over the term “collaborative filtering” ever since.

In the Web 2.0 wave, companies like Digg and Delicious revived the Tapestry-style of collaborative filtering. (Although I’d be surprised if those companies had done so as a conscious effort.) They were in a sense stripped-down versions of Tapestry, blow up to web scale, and made extremely easy to use. (The original Tapestry required one to write database-like queries.)

Now Digg, which one can think of as Tapestry 2.0, is adding ACF back into its style of recommendation and getting extremely positive results. Everything seems to have moved forward, and at the same time it seems to have come full circle.

June 26, 2008

Subtlety in measuring Myspace vs. Facebook int’l traffic

Filed under: People and Data — chucklam @ 2:58 pm

Lately I’ve been seeing different blog posts about how Facebook has overtaken Myspace in international traffic. For example, last week Andrew Chen used Google Trends to find that Facebook is more popular than Myspace in a number of major countries. As an example, he plotted a graph for Australia and showed that Facebook beat Myspace around Oct. ’07.

I tried to create a similar graph for China, and sure enough, Facebook has been beating Myspace, right from the beginning.

The only problem is that the analysis is wrong is this case.

I was at a conference in China last summer, and while I was chilling with a Tsingtao in the hotel room, I noticed that Myspace was doing heavy advertisement on the equivalent of MTV.

And the thing was, they weren’t promoting They were promoting

As for Facebook, as far as I know, they don’t have a Chinese version. returns an error while just forwards to

Redoing the analysis, we compare the traffic of versus We see that Myspace had actually pulled away from Facebook around Oct ’07 and had been maintaining the lead since.

Furthermore, Facebook’s China users are concentrated around Shanghai and Beijing, and they search for things like ‘shanghai map’, ‘beijing airport’, and ‘beijing map’. In other words, they’re expats.

Basically, Myspace had created a site in Chinese and was trying to build up a domestic user base. Facebook, on the other hand, is relying on the spread of American influence in other countries.

My impression is that Facebook is getting all its traffic at, but Myspace is more inconsistent. I took a quick look for Australia and it seems like is the official homepage. In that case, Andrew’s chart for Australia is accurate. I haven’t looked at other countries, but my point is that this kind of analysis is pretty subtle. Just looking at Google charts without knowing the context can be misleading sometimes.

One last note. The battle between Myspace and Facebook is moot when you factor in the really big social networks in China. This is what happens when is included in the graph.

March 14, 2008

Seeing Netflix data as more than just a bunch of numbers

It’s a truism among dataminers that analyzing certain data can help us understand people. However, dataminers rarely see that psychology, the discipline of understanding people, can help get more value out of such data. It was recently reported that one of the top ten contestants in the Netflix Prize approached the challenge from a psychologist’s point of view rather than from a computer scientist’s.

For example, people don’t often give their “true” rating on movies. Instead, they can be biased by anchoring. That is, their rating of a movie is influenced by the ratings they had just given earlier for other movies. Adjusting for biases such as this is how Gavin Potter, aka “Just a guy in a garage,” got to be number 9 in the Netflix Prize leaderboard.

January 20, 2008

Datamining voters

Filed under: Data Collection, Datamining, People and Data, Privacy — chucklam @ 11:33 pm

Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.

A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”

“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”

Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.

January 17, 2008

Encryption versus the Fifth Amendment

Filed under: People and Data — chucklam @ 3:54 am

As usual, Nick Carr has spotted some interesting news:

As the Washington Post reports today, the encryption conflict is now coming to a head. [Sebastien Boucher], accused of storing child pornography on his computer, has refused to provide police with the password required to unlock the encrypted files on his hard drive. He claims that disclosing the password would violate his Fifth Amendment right to avoid self-incrimination. A judge backed his claim, and the government is now appealing that ruling in the federal courts.

The judge’s reasoning, as cited in the original Washington Post article, is quite fascinating:

In his ruling, [Judge] Niedermeier said forcing Boucher to enter his password would be like asking him to reveal the combination to a safe. The government can force a person to give up the key to a safe because a key is physical, not in a person’s mind. But a person cannot be compelled to give up a safe combination because that would “convey the contents of one’s mind,” which is a “testimonial” act protected by the Fifth Amendment, Niedermeier said .

The judge also said that “If Boucher does know the password, he would be faced with the forbidden trilemma: incriminate himself, lie under oath, or find himself in contempt of court.”

I’ve heard of many dilemmas, but trilemma… that’s a new one for me.

January 15, 2008

Jon Kleinberg to speak at Stanford tomorrow

Filed under: Datamining, Network analysis, People and Data, Privacy — chucklam @ 5:13 pm

Date & time: Wednesday, January 16 4:35-5:45 pm
Location: 380-380C
Speaker: Jon Kleinberg, Cornell University

Title: Computational Perspectives on Large-Scale Social Network Data


The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in
formulating models of social processes and in managing complex networks as datasets.

We consider two lines of research within this general theme. The first is concerned with modeling the flow of information through a large network: the spread of new ideas, technologies, opinions, fads, and rumors can be viewed as unfolding with the dynamics of epidemic, cascading from one individual to another through the network. This suggests a basis for computational models of such phenomena, with the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving.

The second line of research we consider is concerned with the privacy implications of large network datasets. An increasing amount of social network research focuses on datasets obtained by measuring the interactions among individuals who have strong expectations of privacy. To preserve privacy in such instances, the datasets are typically anonymized — the names are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed. Unfortunately, there are fundamental limitations on the power of network anonymization to preserve privacy; we will discuss some of these limitations (formulated in joint work with Lars Backstrom and Cynthia Dwork) and some of their broader implications.

Speaker bio:

Jon Kleinberg is a Professor in the Department of Computer Science at Cornell University. His research interests are centered around issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a Fellow of the American Academy of Arts and Sciences, and the recipient of MacArthur, Packard, and Sloan Foundation Fellowships, the Nevanlinna Prize from the International Mathematical Union, and the National Academy of Sciences Award for Initiatives in Research.

MetaWeb receives $42M investment

Filed under: Data Collection, People and Data, Search — chucklam @ 2:45 am

VentureBeat just reported that MetaWeb Technologies received a $42.4M second round investment. I haven’t had time to fully play around with its Freebase database yet, but they certainly seem to be building a war chest for something.

January 5, 2008

How ‘free’ should data be?

Filed under: Data Collection, Open Data, People and Data, Privacy — chucklam @ 2:25 pm

I just read an article in Wired called “Should Web Giants Let Startups Use the Information They Have About You?” The article is really more about issues surrounding Web scraping and data API’s than what the title suggests. (There’s a lot more data out there than just personal information.)

There are many ways to make data more valuable. One can aggregate them. One can add structure/metadata to them. One can filter them. One can join/link different data types together. It’s unrealistic to think that any one organization has gotten all possible value out of any particular data. Someone can always come along and process the data further to generate additional value.

Now, if that’s the case, then it’s economically inefficient (for the general society) if an organization forbids further use of their data. Unfortunately, the debate today tends to be polarized into the free use versus no use camps. One hopes that some kind of market mechanism will emerge as a reasonable middle ground. Today we know so little about the dynamics and the economics of data that it’s not clear what that market will look like and what rules are needed to keep it functioning healthily, but these are issues we need to address rather than throw around rhetoric about ‘freedom’.

November 28, 2007

UC Irvine to offer a certificate program in web analytics

Filed under: Datamining, People and Data — chucklam @ 3:48 am

Looks like web analytics is growing to be a big enough field for universities to start offering a certificate program in it. A brief description of the UC Irvine program below. The full press release is here.

The program provides a foundational knowledge of practical Web analysis to experienced marketing and business professionals in light of the changing face of e-commerce. The impetus for the program is a growing need for Web site evaluation spurred by companies challenged in their search for new customers due primarily to an explosive growth in Web-based sales and marketing.

Older Posts »

Blog at