When doing Web analysis, if one only examines Web content, one needs to keep in mind that the results will be biased by the viewpoints of the content creators (or the “community” of content creators), which don’t necessarily reflect the viewpoints of the content consumers. If you don’t think there’s a gap between author and audience on the Web, simply do a Google search on “Robert”. Rather than getting results for famous Roberts like Robert DeNiro or Robert Kennedy, the top three results point to this one blogger, who’s somehow popular in the blogosphere but is pretty irrelevant otherwise.
To counter this content-creator bias, one should look at usage in addition to content. Recently I’m starting to see research that reveals improvements one can get when usage data is used in addition to content data. (I’m specifically addressing technologies outside of collaborative filtering, which already has a habit of using usage data.)
Matt Richardson, Amit Prakash, and Eric Brill had a paper in WWW2006 called Beyond PageRank: Machine Learning for Static Ranking, where they developed a better Web ranking system than PageRank by a combination of novel algorithm and data sources. An interesting data source they used is the popularity data that MSN toolbar collects (with permission) from users. These data reveal the frequency at which users actually visit particular sites, rather than how often a site gets linked to. The table below shows the top 10 URLs for PageRank versus fRank (Richardson et al’s system). While many factors are involved, there are obviously different biases between Web authors and users. Web authors are notably more tech-oriented than the average Web users.
PageRank fRank google.com google.com apple.com/quicktime/download yahoo.com amazon.com americanexpress.com yahoo.com hp.com microsoft.com/windows/ie target.com apple.com/quicktime bestbuy.com mapquest.com dell.com ebay.com autotrader.com mozilla.org/products/firefox dogpile.com ftc.gov bankofamerica.com
Xuanhui Wang and ChengXiang Zhai will be presenting a paper in SIGIR called Learn from Web Search Logs to Organize Search Results (pdf), where they use query log information to cluster search results. The idea of clustering search results has been around for a long time. I can immediately think of Marti Hearst’s scatter/gather system, although there are probably older work than that. Historically the clustering is done based on the content of the search results. Xuanhui Wang and ChengXiang Zhai examine actual terms that people use to search and come up with cluster categories that are more natural to the users. An example they cite is the search for “area codes.” The top three clusters based on content analysis are “city, state“, “local, area“, and “international.” By looking at the query log, Wang and Zhai’s algorithm came up with the following three clusters: “telephone, city, international“, “phone, dialing“, and “zip, postal.” The log-based method much more accurately reflects people’s desire to look up either “phone codes” or “zip codes” when they search for “area codes.” In addition to getting better clusters, Wang and Zhai also found the query log-based approach to generate more meaningful labels for the clusters than the content-based approach.
To be fair, researchers usually have a very hard time getting usage data. It’s notable that both papers cited above have used Microsoft data, and one of them is even done by researchers outside of Microsoft. Hopefully more companies will let researchers look at their data and publish their analyses.