Data Strategy

October 31, 2008

Stanford engineering classes online

Filed under: Datamining, Information Retrieval — chucklam @ 3:06 am

Just found out that some of my favorite Stanford engineering classes have published their videos online for the general public. Readers of this blog may be especially interested in Chris Manning’s class on Natural Language Processing and Andrew Ng’s class on Machine Learning. In addition, having been an EE person myself, I can highly recommend Steve Boyd’s classes on Linear Systems for those who can deal with more advanced math. He teaches Introduction to Linear Dynamical Systems and Convex Optimization I/II.

A few more classes are also available. Go to the course listing page here.

August 1, 2008

“Collaborative filtering” help drive Digg usage

Filed under: Datamining, Information Retrieval, People and Data, Personalization — chucklam @ 3:25 am

Digg released their “collaborative filtering” system a month ago. Now they’ve blogged about some of the initial results. While it’s an obviously biased point of view, things look really good in general.

  • “Digging activity is up significantly: the total number of Diggs increased 40% after launch.”
  • “Friend activity/friends added is up 24%.”
  • “Commenting is up 11% since launch.”

What I find particularly interesting here is the historical road that “collaborative filtering” has taken. The term “collaborative filtering” was first coined by Xerox PARC more than 15 years ago. Researchers at PARC had a system called Tapestry. It allowed users to “collaborate to help one another perform filtering by recording their reactions to documents they read.” This, in fact, was a precursor to today’s Digg and Delicious.

Soon after PARC created Tapestry, automated collaborative filtering (ACF) was invented. The emphasis was to automate everything and make its usage effortless. Votes were implied by purchasing or other behavior, and recommendation was computed in a “people like you have also bought” style. This style of recommendation was so successful at the time that it had completely taken over the term “collaborative filtering” ever since.

In the Web 2.0 wave, companies like Digg and Delicious revived the Tapestry-style of collaborative filtering. (Although I’d be surprised if those companies had done so as a conscious effort.) They were in a sense stripped-down versions of Tapestry, blow up to web scale, and made extremely easy to use. (The original Tapestry required one to write database-like queries.)

Now Digg, which one can think of as Tapestry 2.0, is adding ACF back into its style of recommendation and getting extremely positive results. Everything seems to have moved forward, and at the same time it seems to have come full circle.

November 9, 2007

List of accepted papers for WSDM’08

The first ACM conference on Web Search and Data Mining (WSDM), to be held at Stanford on Feb. 11-12, has released its list of accepted papers. A total of 24 papers will be presented. The following ones already sound interesting based on their titles.

  • An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising – Anindya Ghose and Sha Yang
  • Ranking Web Sites with Real User Traffic – Mark Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini and Alessandro Vespignani
  • Identifying the Influential Bloggers – Nitin Agarwal, Huan Liu, Lei Tang and Philip Yu
  • Can Social Bookmarks Improve Web Search? – Paul Heymann, Georgia Koutrika and Hector Garcia-Molina
  • Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? – Qiaozhu Mei and Kenneth Church

November 7, 2007

MySpace looking for a search architect

Filed under: Information Retrieval, Search — chucklam @ 2:46 pm

Not sure if I should read too much into this, but MySpace is currently looking for a search architect. The candidate will have “years of experience and expertise in high volume storage, indexing, and searching to help architect, scale, and optimize the engines and API’s behind vertical search applications. Knowledge of search technologies such as Lucene, Lucene.Net and Xapian is important…”

October 8, 2007

Jimmy Wales on “Free Culture and the Future of Search”

Filed under: Collective wisdom, Information Retrieval, Open Data, Search — chucklam @ 2:10 am

Just got word that Jimmy Wales, founder of Wikipedia and Wikia, will be speaking at The Stanford Law and Technology Association this Thursday (Oct. 11) on the topic of “Free Culture and the Future of Search.” The talk will be in Room 190, Stanford Law School, at 12:45pm. Worth checking out if you’re in the area.

September 6, 2007

Spelling corrector that learns from query logs

Filed under: Information Retrieval — chucklam @ 1:59 am

I had previously written a post pointing to a Peter Norvig (Director of Research at Google) article on how to write a statistical spelling corrector. My post noted that Peter’s article didn’t explain how spelling suggestions by search engines (such as Google) learn from query logs. Well, it turns out that Silviu Cucerzan and Eric Brill over at Microsoft had already published a great paper in 2004 at EMNLP called Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users (pdf). It explains some of the unique challenges of spelling correction in search queries. For example, new and unique (but correct) terms become query terms all the time, so one can’t just construct a dictionary of correct spellings. Simple frequency counting also doesn’t work as certain misspelled queries (“britny spears”) occur very often. A misspelled query may be composed of correctly spelled terms (“golf war”). Fortunately, Silviu and Eric show how clever use of the query log can overcome these and other problems.

August 14, 2007

Testing semantic search

Filed under: Information Retrieval, Search — chucklam @ 2:25 am

The Hakia blog has an interesting post on creating query sets to test semantic (i.e. “natural language”) search. The main point is that the query set must have sufficient coverage to test four aspects of natural language search: query type, query length, content type, and word sense disambiguation.

Testing of a semantic search engine requires at least 330 queries just to scratch the surface. hakia’s internal fitness tests, for example, use couple of thousand queries. Therefore, if you see any report or article about the evaluation of a search engine using a dozen of queries, even if it includes valuable insight, it will tell you nothing about the overall state of that search engine.

The distribution of test queries will ultimately be based on real usage, which of course is limited at this point. In general, the more sophisticated function a system is supposed to perform, the more test cases one needs to evaluate it. For something as sophisticated as natural language search (or even plain keyword search), testing is definitely non-trivial.

Powerset’s first public demo was limited in all four of those aspects. Of course, keep in mind that the demo was just to give the public a peek at Powerset’s search technology, not an exhaustive evaluation of it. It was also meant to illustrate the difference between Powerset’s approach and Google’s approach, not to compare and contrast Powerset with Hakia. Personally, I’m just glad that both startups are starting to reveal some of their inner workings.

August 8, 2007

Early results on Google’s personalized search

Filed under: Information Retrieval, Personalization, Privacy, Search — chucklam @ 4:29 pm

Read/WriteWeb has a couple interesting posts on Google’s personalized search. One is a primer written by Greg Linden. In the primer, Greg explains what personalized search is, the motivation for it, Google’s personalization technology from its Kaltix acquisition, and other approaches to personalization. I agree with one of the commenters that Kaltix’s technology probably play only a small role in Google’s personalized search today, as that technology was invented many years ago.

The other post on Read/WriteWeb discusses the finding of a poll in which R/WW asks their readers how effective is Google’s personalized search. Their conclusion is that it is quite underwhelming. 48% “haven’t noticed any difference.” Although 12% claim their “search results have definitely improved,” 9% think their search results “have gotten worse.”

I would venture to guess that R/WW readers are more Web savvy and use search engines more often than the average person. Thus Google would have a lot more data about them and give them the most personalized results. (Yes, it’s only a conjecture. They may in fact be so savvy that they use anonymizing proxies, multiple search engines, etc.) If that’s the case, then even more than 48% of the general population wouldn’t “notice any difference” as Google would not be as aggressive in giving them personalized results.

Google has historically justified the storage of personal information as a necessity to providing personalized services. If they’re unable to demonstrate significant benefits from personalization, then there’s more ammunition to privacy advocates for restricting Google’s data collection practices.

As a user I’m not impressed with Google’s personalized search, and I think the use cases for personalized search that Google and others have talked about are usually too contrived anyways, but I believe it’s still too early to jump to conclusion. Like most R/WW readers, I’m quite savvy with search. I’m pretty good at specifying queries to find what I’m looking for. (And search results that I’m not happy with are almost never due to a lack of personalization.) I know I should search for ‘java indonesia’ if I want to find out about the island and ‘java sdk’ if I’m looking for the Java development kit. If I live in Texas, I won’t necessarily be impressed if searching for ‘paris’ gives me results for Paris, Texas instead of Paris in France. Of course, that’s just me. Other people may be different… or may be they’re not.

July 27, 2007

From information extraction to semantic representation

Filed under: Information Retrieval, Search — chucklam @ 6:14 pm

I suppose this is a huge coincidence that on the day I posted videos of Powerset’s first public demo, I get this workshop announcement in my inbox:

MASTERING THE GAP
From Information Extraction to Semantic Representation

http://tev.itc.it/mtg2007.html
—————————————————————-
The Workshop focuses on the interface between the information extracted from content objects and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The Workshop will provide an opportunity for discussing adequate methods, processes (pipelines) and representation formats for the annotation process.

Automating the process of semantic annotation of content objects is a crucial step for bootstrapping the Semantic Web. This process requires a complex flow of activities which combines competences from different areas. The Workshop will focus precisely on the interface between the information extracted from content objects (e.g., using methods from NLP, image processing, text mining, etc.) and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The workshop will provide an opportunity for: discussing adequate methods, processes (pipelines) and representation formats for the annotation process; reaching a shared
understanding with respect to the terminology in the area; discussing the lessons learned from projects in the area, and putting together a list of the most critical issues to be tackled by the research community to make further progress in the area.

Some of the possible challenges to be discussed at the workshop are:

  • How can ontological/domain knowledge be fed back into the extraction process?
  • How can the semantic layer be extended by the results of information extraction (e.g. ontology learning)?
  • What are the steps of an annotation process, which steps can be standardized for higher flexibility? Which parts are intrinsically application-specific?
  • What are the requirements towards formats for representing semantic annotations?
  • Where is the borderline of automation and how can it be further pushed?
  • How can the linking with the semantic layer be supported on the concept/schema level as well as on the instance level?
  • How can knowledge extracted from different sources with different tools and perhaps different reference ontologies (interoperability) be merged (semi-)automatically?
  • How can extraction technologies for different media (e.g. text and images) be combined and how can the merged extraction results be represented in order to create synergies?

I don’t have much expertise in information extraction or semantic web, but it seems like the challenges stated are quite appropriate for Powerset to look at.

July 22, 2007

Content creators and consumers have different biases

Filed under: Datamining, Information Retrieval, People and Data, Search — chucklam @ 6:16 pm

When doing Web analysis, if one only examines Web content, one needs to keep in mind that the results will be biased by the viewpoints of the content creators (or the “community” of content creators), which don’t necessarily reflect the viewpoints of the content consumers. If you don’t think there’s a gap between author and audience on the Web, simply do a Google search on “Robert”. Rather than getting results for famous Roberts like Robert DeNiro or Robert Kennedy, the top three results point to this one blogger, who’s somehow popular in the blogosphere but is pretty irrelevant otherwise.

To counter this content-creator bias, one should look at usage in addition to content. Recently I’m starting to see research that reveals improvements one can get when usage data is used in addition to content data. (I’m specifically addressing technologies outside of collaborative filtering, which already has a habit of using usage data.)

  • Matt Richardson, Amit Prakash, and Eric Brill had a paper in WWW2006 called Beyond PageRank: Machine Learning for Static Ranking, where they developed a better Web ranking system than PageRank by a combination of novel algorithm and data sources. An interesting data source they used is the popularity data that MSN toolbar collects (with permission) from users. These data reveal the frequency at which users actually visit particular sites, rather than how often a site gets linked to. The table below shows the top 10 URLs for PageRank versus fRank (Richardson et al’s system). While many factors are involved, there are obviously different biases between Web authors and users. Web authors are notably more tech-oriented than the average Web users.

    PageRank fRank
    google.com google.com
    apple.com/quicktime/download yahoo.com
    amazon.com americanexpress.com
    yahoo.com hp.com
    microsoft.com/windows/ie target.com
    apple.com/quicktime bestbuy.com
    mapquest.com dell.com
    ebay.com autotrader.com
    mozilla.org/products/firefox dogpile.com
    ftc.gov bankofamerica.com
  • Xuanhui Wang and ChengXiang Zhai will be presenting a paper in SIGIR called Learn from Web Search Logs to Organize Search Results (pdf), where they use query log information to cluster search results. The idea of clustering search results has been around for a long time. I can immediately think of Marti Hearst’s scatter/gather system, although there are probably older work than that. Historically the clustering is done based on the content of the search results. Xuanhui Wang and ChengXiang Zhai examine actual terms that people use to search and come up with cluster categories that are more natural to the users. An example they cite is the search for “area codes.” The top three clusters based on content analysis are “city, state“, “local, area“, and “international.” By looking at the query log, Wang and Zhai’s algorithm came up with the following three clusters: “telephone, city, international“, “phone, dialing“, and “zip, postal.” The log-based method much more accurately reflects people’s desire to look up either “phone codes” or “zip codes” when they search for “area codes.” In addition to getting better clusters, Wang and Zhai also found the query log-based approach to generate more meaningful labels for the clusters than the content-based approach.

To be fair, researchers usually have a very hard time getting usage data. It’s notable that both papers cited above have used Microsoft data, and one of them is even done by researchers outside of Microsoft. Hopefully more companies will let researchers look at their data and publish their analyses.

Older Posts »

Blog at WordPress.com.