Data Strategy

January 15, 2008

MetaWeb receives $42M investment

Filed under: Data Collection, People and Data, Search — chucklam @ 2:45 am

VentureBeat just reported that MetaWeb Technologies received a $42.4M second round investment. I haven’t had time to fully play around with its Freebase database yet, but they certainly seem to be building a war chest for something.

November 19, 2007

IEEE Computer special issue on search

Filed under: Advertising, Collective wisdom, People and Data, Personalization, Search — chucklam @ 6:08 pm

I’m quite behind on a lot of my readings, so I only got around to reading the IEEE Computer’s (August) special issue on search this weekend. (Abstracts are free but actual PDF’s require an expensive subscription or an expensive purchase.) It includes the following articles:

  • Search Engines that Learn from Implicit Feedback
  • A Community-Based Approach to Personalizing Web Search
  • Sponsored Search: Is Money a Motivator for Providing Relevant Results?
  • Deciphering Trends in Mobile Search
  • Toward a PeopleWeb

The articles were written by a mix of university academics and researchers from Google and Yahoo. They seem targeted at giving the general practitioner a sampling of some of current research, rather than being comprehensive in any specific domain or deep in a particular research area.

For me, the most interesting article is “Search Engines that Learn from Implicit Feedback” by Thorsten Joachims and Filip Radlinski of Cornell University. It’s a very accessible summary of the research those two have been doing in the last few years. To start off their research, they used eye-tracking experiments to characterize how people react to search engine rankings. They found that the ranking order strongly biases what people view and therefore click on. A result in the top ranking will often be clicked on more often than a better result in the second or third ranking, as some users may not even have looked at the results beyond the first ranking. A straightforward assumption that a click is the equivalent of a positive vote is therefore naive. Instead, they examine results that were not clicked on but should have. For example, if results at ranking 3 and 4 are clicked on, but not the result at ranking 2, then one can be sure that the result at ranking 2 is worse than the ones at ranking 3 and 4 and can use that knowledge to improve the search engine. Note that if the result at ranking 1 was clicked on, nothing new is learned. People are so biased towards clicking the first result that only if it was not clicked on would that be considered informative.

Under that model, they can interleave the results from two different search engines (or algorithms) and evaluate which one is better based on users’ clickthroughs. This insight led them to develop a ranking SVM model to learn search engine rankings. The new algorithm was shown to create a better meta-search engine as well as a better domain-specific search engine.

November 9, 2007

List of accepted papers for WSDM’08

The first ACM conference on Web Search and Data Mining (WSDM), to be held at Stanford on Feb. 11-12, has released its list of accepted papers. A total of 24 papers will be presented. The following ones already sound interesting based on their titles.

  • An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising – Anindya Ghose and Sha Yang
  • Ranking Web Sites with Real User Traffic – Mark Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini and Alessandro Vespignani
  • Identifying the Influential Bloggers – Nitin Agarwal, Huan Liu, Lei Tang and Philip Yu
  • Can Social Bookmarks Improve Web Search? – Paul Heymann, Georgia Koutrika and Hector Garcia-Molina
  • Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? – Qiaozhu Mei and Kenneth Church

November 7, 2007

MySpace looking for a search architect

Filed under: Information Retrieval, Search — chucklam @ 2:46 pm

Not sure if I should read too much into this, but MySpace is currently looking for a search architect. The candidate will have “years of experience and expertise in high volume storage, indexing, and searching to help architect, scale, and optimize the engines and API’s behind vertical search applications. Knowledge of search technologies such as Lucene, Lucene.Net and Xapian is important…”

November 3, 2007

Powerset shuffles management, looks for new CEO

Filed under: Search — chucklam @ 4:14 pm

Barney Pell, CEO of Powerset, a natural language search company, has stated in his blog that he’s transitioning to the role of CTO and the company is looking for a new CEO. Co-founder Steve Newcomb has left the company.

More details are being reported at VentureBeat.  Powerset’s product release date has been pushed back to second quarter of 2008 and possibly later. I vlogged about their first public demo back in July.

October 8, 2007

Jimmy Wales on “Free Culture and the Future of Search”

Filed under: Collective wisdom, Information Retrieval, Open Data, Search — chucklam @ 2:10 am

Just got word that Jimmy Wales, founder of Wikipedia and Wikia, will be speaking at The Stanford Law and Technology Association this Thursday (Oct. 11) on the topic of “Free Culture and the Future of Search.” The talk will be in Room 190, Stanford Law School, at 12:45pm. Worth checking out if you’re in the area.

August 14, 2007

Testing semantic search

Filed under: Information Retrieval, Search — chucklam @ 2:25 am

The Hakia blog has an interesting post on creating query sets to test semantic (i.e. “natural language”) search. The main point is that the query set must have sufficient coverage to test four aspects of natural language search: query type, query length, content type, and word sense disambiguation.

Testing of a semantic search engine requires at least 330 queries just to scratch the surface. hakia’s internal fitness tests, for example, use couple of thousand queries. Therefore, if you see any report or article about the evaluation of a search engine using a dozen of queries, even if it includes valuable insight, it will tell you nothing about the overall state of that search engine.

The distribution of test queries will ultimately be based on real usage, which of course is limited at this point. In general, the more sophisticated function a system is supposed to perform, the more test cases one needs to evaluate it. For something as sophisticated as natural language search (or even plain keyword search), testing is definitely non-trivial.

Powerset’s first public demo was limited in all four of those aspects. Of course, keep in mind that the demo was just to give the public a peek at Powerset’s search technology, not an exhaustive evaluation of it. It was also meant to illustrate the difference between Powerset’s approach and Google’s approach, not to compare and contrast Powerset with Hakia. Personally, I’m just glad that both startups are starting to reveal some of their inner workings.

August 8, 2007

Early results on Google’s personalized search

Filed under: Information Retrieval, Personalization, Privacy, Search — chucklam @ 4:29 pm

Read/WriteWeb has a couple interesting posts on Google’s personalized search. One is a primer written by Greg Linden. In the primer, Greg explains what personalized search is, the motivation for it, Google’s personalization technology from its Kaltix acquisition, and other approaches to personalization. I agree with one of the commenters that Kaltix’s technology probably play only a small role in Google’s personalized search today, as that technology was invented many years ago.

The other post on Read/WriteWeb discusses the finding of a poll in which R/WW asks their readers how effective is Google’s personalized search. Their conclusion is that it is quite underwhelming. 48% “haven’t noticed any difference.” Although 12% claim their “search results have definitely improved,” 9% think their search results “have gotten worse.”

I would venture to guess that R/WW readers are more Web savvy and use search engines more often than the average person. Thus Google would have a lot more data about them and give them the most personalized results. (Yes, it’s only a conjecture. They may in fact be so savvy that they use anonymizing proxies, multiple search engines, etc.) If that’s the case, then even more than 48% of the general population wouldn’t “notice any difference” as Google would not be as aggressive in giving them personalized results.

Google has historically justified the storage of personal information as a necessity to providing personalized services. If they’re unable to demonstrate significant benefits from personalization, then there’s more ammunition to privacy advocates for restricting Google’s data collection practices.

As a user I’m not impressed with Google’s personalized search, and I think the use cases for personalized search that Google and others have talked about are usually too contrived anyways, but I believe it’s still too early to jump to conclusion. Like most R/WW readers, I’m quite savvy with search. I’m pretty good at specifying queries to find what I’m looking for. (And search results that I’m not happy with are almost never due to a lack of personalization.) I know I should search for ‘java indonesia’ if I want to find out about the island and ‘java sdk’ if I’m looking for the Java development kit. If I live in Texas, I won’t necessarily be impressed if searching for ‘paris’ gives me results for Paris, Texas instead of Paris in France. Of course, that’s just me. Other people may be different… or may be they’re not.

July 29, 2007

Technology Review article on Powerset

Filed under: Search — chucklam @ 2:24 pm

OK, I know this is a bit much – three posts in a row that’s related to Powerset, and there will still be a Part 2 to the Powerset demo post. But an article on Powerset just came out from Technology Review, and there are a few interesting nuggets in it.

A key component of the search engine is a deep natural-language processing system that extracts the relationships between words; the system was developed from PARC’s Xerox Linguistic Environment (XLE) platform. The framework that this platform is based on, called Lexical Functional Grammar, enabled the team to write different grammar engines that help the search engine understand text. This includes a robust, broad-coverage grammar engine written by PARC.

The article also mentions a semantic search engine being developed by IBM.

IBM is also in the midst of developing a semantic search engine, code-named Avatar, which is targeted at enterprise and corporate customers; it’s currently in beta testing within IBM. Project manager Shivakumar Vaithyanathan says that the hardest problems to overcome with natural-language search are finding a way to extract higher-level semantics from large documents while at the same time preserving precision and speed.

July 27, 2007

From information extraction to semantic representation

Filed under: Information Retrieval, Search — chucklam @ 6:14 pm

I suppose this is a huge coincidence that on the day I posted videos of Powerset’s first public demo, I get this workshop announcement in my inbox:

From Information Extraction to Semantic Representation
The Workshop focuses on the interface between the information extracted from content objects and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The Workshop will provide an opportunity for discussing adequate methods, processes (pipelines) and representation formats for the annotation process.

Automating the process of semantic annotation of content objects is a crucial step for bootstrapping the Semantic Web. This process requires a complex flow of activities which combines competences from different areas. The Workshop will focus precisely on the interface between the information extracted from content objects (e.g., using methods from NLP, image processing, text mining, etc.) and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The workshop will provide an opportunity for: discussing adequate methods, processes (pipelines) and representation formats for the annotation process; reaching a shared
understanding with respect to the terminology in the area; discussing the lessons learned from projects in the area, and putting together a list of the most critical issues to be tackled by the research community to make further progress in the area.

Some of the possible challenges to be discussed at the workshop are:

  • How can ontological/domain knowledge be fed back into the extraction process?
  • How can the semantic layer be extended by the results of information extraction (e.g. ontology learning)?
  • What are the steps of an annotation process, which steps can be standardized for higher flexibility? Which parts are intrinsically application-specific?
  • What are the requirements towards formats for representing semantic annotations?
  • Where is the borderline of automation and how can it be further pushed?
  • How can the linking with the semantic layer be supported on the concept/schema level as well as on the instance level?
  • How can knowledge extracted from different sources with different tools and perhaps different reference ontologies (interoperability) be merged (semi-)automatically?
  • How can extraction technologies for different media (e.g. text and images) be combined and how can the merged extraction results be represented in order to create synergies?

I don’t have much expertise in information extraction or semantic web, but it seems like the challenges stated are quite appropriate for Powerset to look at.

Older Posts »

Blog at