Data Strategy

August 1, 2008

“Collaborative filtering” help drive Digg usage

Filed under: Datamining, Information Retrieval, People and Data, Personalization — chucklam @ 3:25 am

Digg released their “collaborative filtering” system a month ago. Now they’ve blogged about some of the initial results. While it’s an obviously biased point of view, things look really good in general.

  • “Digging activity is up significantly: the total number of Diggs increased 40% after launch.”
  • “Friend activity/friends added is up 24%.”
  • “Commenting is up 11% since launch.”

What I find particularly interesting here is the historical road that “collaborative filtering” has taken. The term “collaborative filtering” was first coined by Xerox PARC more than 15 years ago. Researchers at PARC had a system called Tapestry. It allowed users to “collaborate to help one another perform filtering by recording their reactions to documents they read.” This, in fact, was a precursor to today’s Digg and Delicious.

Soon after PARC created Tapestry, automated collaborative filtering (ACF) was invented. The emphasis was to automate everything and make its usage effortless. Votes were implied by purchasing or other behavior, and recommendation was computed in a “people like you have also bought” style. This style of recommendation was so successful at the time that it had completely taken over the term “collaborative filtering” ever since.

In the Web 2.0 wave, companies like Digg and Delicious revived the Tapestry-style of collaborative filtering. (Although I’d be surprised if those companies had done so as a conscious effort.) They were in a sense stripped-down versions of Tapestry, blow up to web scale, and made extremely easy to use. (The original Tapestry required one to write database-like queries.)

Now Digg, which one can think of as Tapestry 2.0, is adding ACF back into its style of recommendation and getting extremely positive results. Everything seems to have moved forward, and at the same time it seems to have come full circle.


November 19, 2007

IEEE Computer special issue on search

Filed under: Advertising, Collective wisdom, People and Data, Personalization, Search — chucklam @ 6:08 pm

I’m quite behind on a lot of my readings, so I only got around to reading the IEEE Computer’s (August) special issue on search this weekend. (Abstracts are free but actual PDF’s require an expensive subscription or an expensive purchase.) It includes the following articles:

  • Search Engines that Learn from Implicit Feedback
  • A Community-Based Approach to Personalizing Web Search
  • Sponsored Search: Is Money a Motivator for Providing Relevant Results?
  • Deciphering Trends in Mobile Search
  • Toward a PeopleWeb

The articles were written by a mix of university academics and researchers from Google and Yahoo. They seem targeted at giving the general practitioner a sampling of some of current research, rather than being comprehensive in any specific domain or deep in a particular research area.

For me, the most interesting article is “Search Engines that Learn from Implicit Feedback” by Thorsten Joachims and Filip Radlinski of Cornell University. It’s a very accessible summary of the research those two have been doing in the last few years. To start off their research, they used eye-tracking experiments to characterize how people react to search engine rankings. They found that the ranking order strongly biases what people view and therefore click on. A result in the top ranking will often be clicked on more often than a better result in the second or third ranking, as some users may not even have looked at the results beyond the first ranking. A straightforward assumption that a click is the equivalent of a positive vote is therefore naive. Instead, they examine results that were not clicked on but should have. For example, if results at ranking 3 and 4 are clicked on, but not the result at ranking 2, then one can be sure that the result at ranking 2 is worse than the ones at ranking 3 and 4 and can use that knowledge to improve the search engine. Note that if the result at ranking 1 was clicked on, nothing new is learned. People are so biased towards clicking the first result that only if it was not clicked on would that be considered informative.

Under that model, they can interleave the results from two different search engines (or algorithms) and evaluate which one is better based on users’ clickthroughs. This insight led them to develop a ranking SVM model to learn search engine rankings. The new algorithm was shown to create a better meta-search engine as well as a better domain-specific search engine.

November 9, 2007

List of accepted papers for WSDM’08

The first ACM conference on Web Search and Data Mining (WSDM), to be held at Stanford on Feb. 11-12, has released its list of accepted papers. A total of 24 papers will be presented. The following ones already sound interesting based on their titles.

  • An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising – Anindya Ghose and Sha Yang
  • Ranking Web Sites with Real User Traffic – Mark Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini and Alessandro Vespignani
  • Identifying the Influential Bloggers – Nitin Agarwal, Huan Liu, Lei Tang and Philip Yu
  • Can Social Bookmarks Improve Web Search? – Paul Heymann, Georgia Koutrika and Hector Garcia-Molina
  • Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? – Qiaozhu Mei and Kenneth Church

October 18, 2007

How Acxiom uses offline data for behavioral targeting

Filed under: Advertising, Data Collection, Datamining, People and Data, Personalization — chucklam @ 12:37 am

A really fascinating piece at the WSJ yesterday: Firm Mines Offline Data To Target Online Ads (subscription req.). There’s a particular side-bar that reveals how Acxiom uses offline data for behavioral targeting:

How Acxiom delivers personalized online ads:

  1. Acxiom has accumulated a database of about 133 million households and divided it into 70 demographic and lifestyle clusters based on information available from… public sources.
  2. A person gives one of Acxiom’s Web partners his address by buying something, filling out a survey or completing a contest form on one of the sites.
  3. In an eyeblink, Acxiom checks the address against its database and places a “cookie,” or small piece of tracking software, embedded with a code for that person’s demographic and behavioral cluster on his computer hard drive.
  4. When the person visits an Acxiom partner site in the future, Acxiom can use that code to determine which ads to show…
  5. Through another cookie, Acxiom tracks what consumers do on partner Web sites…

It’s an interesting approach. While Acxiom has offline information on “133 million households,” it’s not clear how many households actually have gotten the offline-tracking cookie.

One can imagine Google really taking advantage of an approach like this. The wide spread use of Google Analytics across web sites already gives Google the potential to track your surfing habit. Specifying your home address when you use Google Maps or Checkout allows Google to match you with offline marketing databases. And we haven’t even talked about your query data and your emails yet…

September 25, 2007

Social networks turning to targeted advertising

Filed under: Advertising, Personalization — chucklam @ 5:04 am

I saw a number of articles in the last couple months on how Myspace and Facebook are turning to personalized ad targeting. Just thought to note them here.

September 21, 2007

Google to enable access to its social graph data

According to a TechCrunch post:

Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.

Not much info yet, but can’t wait…

September 8, 2007

AdSense versus just begging your users for money

Filed under: Advertising, Personalization — chucklam @ 4:46 pm

Last December I built a little web site for fun called The site is a simple way of automatically generating HTML/CSS code for creating rounded corners on a web page design. In addition to being a fun project, I also used the site as an opportunity to learn about several advertising/monetization schemes.

The site gets between 400 to 900 pageviews per day, with an average of 740 and a grand total of 205,000 to date. (As far as monetization is concerned, the site is really just a one-page site.) I have various monetization schemes on the page including Google AdSense, Amazon Omakase, some affiliate marketing run by Commission Junction, and just plain begging for a $5 “fee.” (I ran into trouble with the Google Checkout police when I called it a “contribution,” but that’s another story.) The affiliate marketing is chosen by me so it’s highly relevant, but it’s gotten no revenue at all in the last 10 months 😦 The Amazon Omakase program is Amazon’s contextual advertising/referral program. You can think of it as AdSense with “behavioral targeting” but only promotes products on Amazon and you get paid by referrals instead of clicks. Personally I’m impressed with how well targeted the Amazon ads are, but they’re quite poor in terms of monetization. People don’t seem to be clicking on them much and so far I’ve only had one who ended up purchasing something. I don’t know why such well targeted ads can do so poorly. My hope is that it’s just specific to my site’s audience. Maybe they already got their web design book or whatever they need from Amazon and have no need to purchase more…

The surprising thing is that AdSense has so far only made a few more dollars than begging. Begging has gotten me $120 in the last 10 months, and the users have to scroll all the way down the web page and read the details to even know that I’m begging for $5. A straightforward calculation would say that I’ll lose half my revenue if I take out all the ads, but my sense is that a lot more users will feel more comfortable paying the $5 “fee” when they don’t see any ads on the page. Will that completely compensate for the lost of AdSense revenue? I don’t know. Maybe, maybe not, but I do feel that users’ contribution is a better “quality” income than advertising revenue. In fact, I think AdSense advertising is so bad on my site that I’m surprised anyone has clicked on them at all, even though somehow there were 1500 clicks. Of course, in the grand scheme of things, the financial impact is so little that it’s not worth strategizing over.

Anyways, that’s one data point from me. Anyone else want to share their experience?

August 8, 2007

Early results on Google’s personalized search

Filed under: Information Retrieval, Personalization, Privacy, Search — chucklam @ 4:29 pm

Read/WriteWeb has a couple interesting posts on Google’s personalized search. One is a primer written by Greg Linden. In the primer, Greg explains what personalized search is, the motivation for it, Google’s personalization technology from its Kaltix acquisition, and other approaches to personalization. I agree with one of the commenters that Kaltix’s technology probably play only a small role in Google’s personalized search today, as that technology was invented many years ago.

The other post on Read/WriteWeb discusses the finding of a poll in which R/WW asks their readers how effective is Google’s personalized search. Their conclusion is that it is quite underwhelming. 48% “haven’t noticed any difference.” Although 12% claim their “search results have definitely improved,” 9% think their search results “have gotten worse.”

I would venture to guess that R/WW readers are more Web savvy and use search engines more often than the average person. Thus Google would have a lot more data about them and give them the most personalized results. (Yes, it’s only a conjecture. They may in fact be so savvy that they use anonymizing proxies, multiple search engines, etc.) If that’s the case, then even more than 48% of the general population wouldn’t “notice any difference” as Google would not be as aggressive in giving them personalized results.

Google has historically justified the storage of personal information as a necessity to providing personalized services. If they’re unable to demonstrate significant benefits from personalization, then there’s more ammunition to privacy advocates for restricting Google’s data collection practices.

As a user I’m not impressed with Google’s personalized search, and I think the use cases for personalized search that Google and others have talked about are usually too contrived anyways, but I believe it’s still too early to jump to conclusion. Like most R/WW readers, I’m quite savvy with search. I’m pretty good at specifying queries to find what I’m looking for. (And search results that I’m not happy with are almost never due to a lack of personalization.) I know I should search for ‘java indonesia’ if I want to find out about the island and ‘java sdk’ if I’m looking for the Java development kit. If I live in Texas, I won’t necessarily be impressed if searching for ‘paris’ gives me results for Paris, Texas instead of Paris in France. Of course, that’s just me. Other people may be different… or may be they’re not.

August 7, 2007

Advertising’s digital future

Filed under: Advertising, Datamining, Personalization, Statistical experimentation — chucklam @ 12:23 pm

The New York Times yesterday had an article on advertising’s digital future. It mostly discussed the view of David W. Kenny, chairman and chief executive of Digitas, the advertising agency in Boston that was acquired by the Publicis Groupe for $1.3 billion six months ago.

The plan is to build a global digital ad network that uses offshore labor to create thousands of versions of ads. Then, using data about consumers and computer algorithms, the network will decide which advertising message to show at which moment to every person who turns on a computer, cellphone or — eventually — a television.

“Our intention with Digitas and Publicis is to build the global platform that everybody uses to match data with advertising messages,” Mr. Kenny said.

That is, advertising in the future will be much more data driven. Now, if we take that vision for granted, then the interesting question will be Who will end up controlling what data? No doubt Mr. Kenny would love to see advertising agencies being the central gateway, if not the outright owner, of all such data. However, privacy advocates, media companies, new “intermediaries”, and search engines like Google all have different ideas about their ownership of data and their place in this advertising future. It’s too early to tell how things will turn out, and everyone is making educated guesses.

“How do we see Google, Yahoo and Microsoft? It’s important to see that our industry is changing and the borders are blurring, so it’s clear the three of those companies will have a huge share of revenues which will come from advertising,” said Maurice Lévy, chairman and chief executive of the Publicis Groupe.

“But they will have to make a choice between being a medium or being an ad agency, and I believe that their interest will be to be a medium,” he added. “We will partner with them as we do partner with CBS, ABC, Time Warner or any other media group.”

I wonder if Mr. Lévy has considered the possibility that in this digital future, Google may in fact be CBS, ABC, and Time Warner combined.

August 1, 2007

E-commerce benefiting from new recommendation systems

Filed under: Collaborative filtering, People and Data, Personalization — chucklam @ 2:33 pm

WSJ has an article yesterday “We Know What You Ought To Be Watching This Summer” (subscription required) on how e-commerce sites are benefiting from deploying a new generation of recommendation systems. These new systems try to recommend products based on a sense of your “taste” that may not be obvious from statistics alone. These systems are working and providing concrete business results.

Since adding the software, Blockbuster says it has lost fewer customers, in percentage terms, to rival services, and the number of movies in the average customer’s “to watch” list has grown by almost 50%. is using these new systems in its email advertising to its customers.

The targeted emails have increased the rate at which email recipients go on to make a purchase between 25% and 50%, says [ CEO] Mr. Byrne.

Sucharita Mulpuru, a senior analyst at Forrester Research, claims that companies that implement these recommendation systems usually see at least a 10% bump in sales.

Older Posts »

Create a free website or blog at