Data Strategy

June 30, 2007

Measuring similarity: consider invariance

Filed under: Datamining, Information Retrieval, Search — chucklam @ 6:56 pm

Somehow in the last couple days I was talking to different people about automated ways of finding similar content. One was thinking about finding similar documents in a certain text domain (that I couldn’t reveal yet). Another was just thinking out loud about finding similar images (in a domain that… um… let’s just say makes a lot of money and is consistently an early adopter of video technologies…).

The interesting thing is that both person have some computer science background but no specialty in pattern recognition or statistics. They’ve heard of cases where finding similar documents or images have worked, so they imagine that the general problem has been solved, and there shouldn’t be any difficulty to their specific application.

The fact of the matter is much more nuanced than this. It’s quite easy to invent some heuristic measure that calculates the similarity of two documents. You can apply it to your document set, and in fact it does return documents that are similar in some way

Now, if the heuristic turns out to measure similarity in exactly the way you’ve intended it, then life is great. You give the heuristic some fancy name to impress your boss or the media, and you’re done.

Unfortunately, all too often there’s more than one way to consider similarity, and your measure ends up being influenced by all these other ways. Take text documents for example. People tend to think of similarity as topical similarity, but documents can in fact be similar for other reasons. They may have similar length, and they may have similar style if they’re from the same author, publisher, or genre (news, blogs, academic papers, etc.). Take facial images as another example, two images can be considered similar if they’re taken under similar lighting conditions, even for completely different subjects. Humans have a natural bias to ignore those kinds of similarity, but objectively they’re as legitimate as any other ways of considering similarity.

Most of the development work will actually be in tweaking the similarity measure so that it’s invariant to all these other kinds of similarity. You may need to normalize the input (against document length or lighting condition), and you may have to consider a totally different set of input features that are more robust. Successful cases in restricted domains (e.g., finding similar articles within a news site) are certainly good places to start your heuristic, but don’t underestimate the potential amount of work needed to adapt to a related problem domain.


June 27, 2007

Social tagging and voting was invented at Xerox PARC… 15 years ago

Filed under: Collaborative filtering, Personalization — chucklam @ 4:56 pm

It’s easy to believe that social tagging started with and social voting started with digg. However, Xerox PARC had developed such functions in a system called Tapestry more than 15 years ago. The system was described in a 1992 Communications of the ACM article. (Official ACM link here. A “publicly” available pdf version here. A slide presentation here.) From the article:

“The Tapestry system was designed and built to support collaborative filtering. Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. Such reactions may be that a document was particularly interesting (or particularly uninteresting). These reactions, more generally called annotations, can be accessed by others’ filters.” (Emphasis theirs.)

This paper, in fact, was the first to coin the term ‘collaborative filtering,’ which over the years had evolved to mean the special case of automated recommendation using implicit feedback (i.e. Amazon style). Tapestry was architected for the general case. It assumed that “some annotations are themselves complex objects, and those annotations are more simply stored as separate records with pointers back to the document they annotate.” This design would sound familiar to anyone who had implemented a “modern” social tagging and voting system. See, for example, the design of Askeet.

It’s interesting to read a paper from 15 years ago and get some historical perspective. Xerox PARC had gotten the skeleton design of Web 2.0 functions before there was Web 1.0! It’s always amusing to read things like this too:

“Filtering on incoming documents is a very computationally intensive task. Imagine a Tapestry system with hundreds of users, each with dozens of filter queries, running on a document stream of tens of documents per minute.”

Yeah… We all need to thank the electrical engineers that make Moore’s law a reality…

June 25, 2007

Real Good: ISP Deleted Advertising

Filed under: Advertising — chucklam @ 10:45 pm

TechCrunch had a post a couple days ago discussing a Texas ISP that inserts advertising into pages its customers visit. That post is called Real Evil: ISP Inserted Advertising. Needless to say, TechCrunch is judging the practice to be “real evil.”

As a thought exercise, what if an ISP does the exact opposite and offers ad blocking as part of its service? Would that make it “good”? This can be done either on the client side with software (e.g. Firefox plug-in) that comes with an ISP’s installation CD or on the server side. Functionally, the first option would really be no different than the pop-up blockers already included in IE and Firefox. Besides, such ad blocker plug-in for Firefox does exist and is free to download on the Web. There’s little incremental cost to the ISP and it can attract more customers that way. This benefits the readers at the expense of publishers. Right now not enough people use ad blockers to worry publishers too much, but what happens when a major ISP supports it? What if it’s part of IE? Microsoft can ostensibly add the feature as a benefit to consumers, but with the real intention of destroying Google’s advertising revenue. They will piss off a lot of other people in the Google ecosystem too, but if they really want to…

June 22, 2007

Getting dumber faster

Filed under: Uncategorized — chucklam @ 4:38 pm

Jeff Jonas had just added a post to his blog on Why Faster Systems Can Make Organizations Dumber Faster. Basically he’s saying that if an organization is getting data faster than their ability to make sense of it, then the organization is getting dumber relative to its potential. I’m not sure if “dumber” is the right term to describe this phenomenon objectively, but people definitely do perceive organizations this way:

“They had all the information to identify the terrorist, but they still missed him!”

“They already know my mailing address, why do they ask me again?!”

Calling an organization “dumb” is more than just an assessment of its intelligence. It’s also an assignment of blame. When an organization has all the data necessary to make a good decision, people feel that the organization should make a good decision. When an organization has not enough data, the blame can go to all kinds of external sources.

Just published: 2nd edition of Finn Jensen’s Bayesian Networks and Decision Graphs

Filed under: Bayesian networks, Datamining — chucklam @ 12:15 am

The second edition of Finn Jensen’s Bayesian Networks and Decision Graphs was just published this month. I haven’t read it yet, but I’m a fan of the first edition and Finn’s other book: An Introduction to Bayesian Networks. The books are very accessible and work for self-study.

Looking at the table of content, the major addition to the 2nd edition is chapters on learning from data, both for parameter estimation and structure learning. Not dealing with learning was a major hole in the first edition. It made the book less useful for people from the machine learning community. Fixing that in the 2nd edition should make this a welcome introduction to Bayesian networks for all practitioners.

More info at their website.

June 21, 2007

Tools for visualizing and analyzing data collaboratively

Filed under: Datamining, Visualization — chucklam @ 12:40 am

Most data analysis, especially in the business world, are done through graphs: People load up data into Excel and graph them, looking for interesting and unusual patterns. Swivel is a web site that focuses on doing just that kind of data analysis online. Unlike Excel, it leverages some of the power of the web. For example, people can leave comments on each graph. People can digg the ones they like. Data with a geographic component can be plotted onto online maps. Etc.

However, what got me really excited today about this topic was seeing something else, a screencast of (screencast here and homepage here). It’s always hard to describe visualization tools, so I suggest you to just check out the screencast to see what it is.

If you have ever done collaborative data analysis, you’ll know the following drill. You have some data, see some patterns in it after a little exploration, and you create some summary graphs that highlight those patterns. You show those graphs around to either convince your colleagues of your hypothesis or to solicit some ideas. As always, other people will offer alternate hypotheses, give you contextual information that you didn’t know before (“A bug was on that sensor…”), raise new questions, and some may even want to look at the data themselves. So you go back to the data, re-do the graphs/analysis based on the feedback, and iterate the process. This can be time consuming. It’s kind of like collaborating by emailing a document around.

What’s needed is a data graphing tool that’s designed from the ground up to support the collaborative analysis of data, and this is what is. It’s a web-based tool that lets you generate new graphs quickly and interactively. More importantly, it supports collaboration by using well-known web tools such as bookmarking, commenting, and linking. To describe it in another way, is useful for data analysis like version control is useful for programming.

The developer of is Jeffrey Heer, a Ph.D. candidate at UC Berkeley. He’ll be talking about the system in a Yahoo Brain Jam session in Berkeley on June 29. More info here. I don’t know if I’ll have time to drive up to Berkeley for the talk yet, but we’ll see…

June 19, 2007

Upcoming datamining conferences in the Bay Area

Filed under: Datamining — chucklam @ 6:00 pm

Just thought to note two of the upcoming conferences in the Bay Area on datamining. They should be interesting to the large number of web-related companies in the Bay Area in which data analysis is a main competitive advantage.

Knowledge Discovery and Data Mining (KDD’07) will be at the Fairmont Hotel in San Jose on August 12th to 15th. The conference is in its 13th year and has become a major venue for bridging ideas between academia and industry. It will have a number of interesting workshops this year. There’s one workshop on Data Mining and Audience Intelligence for Advertising (ADKDD), which deals with data mining and online advertising. Another workshop is on Web Mining and Social Network Analysis. The KDD Cup data mining competition this year is related to the Netflix prize. There are a lot more interesting workshops and tutorials and papers, so keep your eyes on the conference web site.

The First International Conference on Web Search and Data Mining (WSDM) is a brand new conference, to be held at Stanford on Feb. 11 and 12, 2008. While technically the conference is “new,” it really is an extension of the longstanding World Wide Web (WWW) conference, which has grown a bit too big trying to accommodate all the web-related research areas. The organizers of WSDM are some of the most accomplished researchers in the search engine world, and there’s high expectation that the conference will be off to a great start. The submission deadline is July 30, 2007.

June 18, 2007

Culture and language on the web

Filed under: People and Data, Search — chucklam @ 6:29 pm

Recently there’s been a couple blog posts wondering about the effects of culture and language on search behavior:

These questions are certainly fascinating, and there should definitely be more formal studies on them. However, so far it has been content, rather than search behavior, where regional, cultural, and linguistic differences have had the bigger impact. I’ll give a couple examples.

When Google first commercialized a link-based ranking system (PageRank), it took advantage of non-textual information and was thus language independent. Link-based ranking improved English search quite a bit, but the difference in search for other European languages was much more dramatic. Search engines at that time was so focused on the U.S. market that they didn’t bother to optimize their text-based ranking system for other languages, or they only made the simple, small changes (e.g. different list of stop words). And since their portal strategy ended up competing with local players, they were also much less effective using their portals as distribution channel for search. By using language-independent link-based ranking and partnering with local portals, Google took Europe by storm and continued their dominance to this date.

Yahoo was successful in Korea for a long time, until it was dethroned by a local newcomer called Naver. Did Naver come up with a better search engine? No. What Naver realized was that search really didn’t matter much in Korea. Korea has a relatively small population for a country and its language is not spoken anywhere else. There simply isn’t much Korean web content to search. Instead, Naver created a forum where people can answer questions posted by other people. This solved people’s problem of finding information (rather than just search) and propelled Naver to the number one spot. Yahoo learned its lesson and rolled out its own Yahoo Answers in other countries, including the U.S. Yahoo Answers hasn’t solved all of Yahoo’s problems, but it is considered one of the more successful Yahoo products in recent memory.

Do you know of other examples? Share them in the comments area!

A case study: Analyzing terabytes of click stream data

Filed under: Uncategorized — chucklam @ 12:56 pm

In the process of fixing a broken link on my previous post, I found another interesting webinar hosted by MySQL.  This one is a case study on how to analyze terabytes of click stream data to improve advertising effectiveness. Being a database-centric talk, the focus will be on performance rather than statistics.

This webinar will present a case study of how a large online ad network uses MySQL and the Infobright storage engine to analyze web log files with the goal of improving campaign effectiveness.

In less than 1 TB the system houses up to 9 TB of data collected every 6 months. More than 100 million rows of data are collected per day. The data includes details such as viewed ads, ads acted on, geographical location and demographics information. Using this information analytics are used to determine the attributes of a user with the goal of finding out which users should be presented with which ads to maximize clicks and conversions.

The queries executed on this system are extremely complex. Billions of rows of data are compared to determine what the unique attributes of a user are. This talk will examine these queries and discuss how they can be executed with excellent performance.

The webinar is on June 21 at 10am PDT. Register for it here.

June 16, 2007

Collective intelligence in Word 2007 spell checker

Filed under: Collective wisdom, Data Collection, Open Data — chucklam @ 5:30 pm

Via Gregor Hochmuth thru O’Reilly Radar, pointing out the use of collective intelligence in improving Word 2007’s spell checker:

I thought you might enjoy this: When I was closing Word 2007 today, I was surprised to see the attached dialog pop-up. Microsoft’s new spell checker asked me whether it could transmit certain unknown words and phrases that I used in the last several weeks.

Among them are choice examples like Wikipedia, Gladwell, shortcode and others — words that were certainly not in the original distributable. I assume Microsoft will re-distribute the most frequently submitted words in an upcoming spell checker update. Brilliant! And it reminds of the way in which Google first introduced its “Did you mean…?” feature– by tracking how users corrected their own spelling mistakes before re-trying a search.

Tim O’Reilly notes that a distinguishing feature of Web 2.0 apps (“Live Software”) is that they improve organically when more people use them more often. That is, the apps are architected to grow on usage data. This is certainly an idea I’ll have more to say in the future.

Older Posts »

Blog at