Data Strategy

July 31, 2007

Mashups getting attention from mainstream news media

Filed under: Data Collection, Open Data, Visualization — chucklam @ 3:50 am

I’ve come across two articles in the mainstream news media in less than a week on mashups. The New York Times has an article With Tools on Web, Amateurs Reshape Mapmaking that focuses on mapping mashups. The Wall Street Journal today has ‘Mashups’ Sew Data Together (subscription required).

It certainly seems like there are enough useful mashups already that mainstream press is taking notice at the concept. However, mashups are really a fuzzy collection of three related applications: data visualization, data integration, and data collection. The unifying concept being that mashups do them using lightweight Web interfaces.

The majority of existing mashups don’t do much more than data visualization, usually showing location-related information on a map. This doesn’t involve any data integration as it simply takes data from one source and feeds them to a visualization (i.e. mapping) engine.

Some mashups are starting to take data from more than one Web source (or even just scraping from different Web sites) and create new applications by integrating those data. Microformat is targeted at exactly this kind of mashup. As with traditional data integration projects, deep integration is usually thwarted by incompatible data definitions. The main hope is that lightweight integration, together with the large number of data sources, will produce a lot of mashups. Even though these mashups are not too sophisticated individually, collectively they will generate a lot of value. Think of the lightweight Web interfaces as the duct tape of data.

Finally, mashups are also about data collection. That’s never stated as any mashup’s primary goal but is often a byproduct of the useful ones. The mapping APIs, for example, encourage the creation of many geographically-related data collections, as the NYT article points out. People are willing to contribute data, but they do want to see immediate gratification from such contributions, and good mashups can show them that. Furthermore, as operating systems win by building an ecosystem of applications around them, Web APIs depend on an ecosystem of data sources to turn them into platforms. I’d encourage anyone developing mashup APIs to think deeply about how people can potentially create data collections on top of their interfaces. An incoherent set of APIs doesn’t make a platform, just like a collection of device drivers doesn’t make an operating system.


July 30, 2007

Hadoop gaining momentum

Filed under: Infrastructure — chucklam @ 2:06 pm

While it’s not exactly a secret that Yahoo supports the open source project Hadoop, I think it’s only in the last week that I’ve seen an official description of Yahoo’s contribution. For those who don’t know, Hadoop is an open source implementation of Google’s MapReduce system. It provides a programming framework for processing large datasets using clusters of machines. By following the Hadoop framework, programmers are automatically given scalability and reliability on a distributed processing platform.

Besides supporting the basic project, Yahoo (through its Yahoo Research division) has also developed an abstraction interface layer on top of Hadoop called Pig. Pig allows one to express data analysis tasks in relational algebra, giving them some semblance to SQL.

Another extension to Hadoop is Hbase, a clone of Google’s Bigtable used to store structured data over a distributed system. And in case you don’t have clusters of machines lying around, Hadoop has also been extended with the ability to run on Amazon’s EC2.

Furthermore, some Stanford researchers have recently published a paper called Map-Reduce for Machine Learning on Multicore (pdf) that demonstrate how many popular machine learning algorithms (such as logistic regression, naive Bayes, SVM, neural networks, and others) can easily be adapted to the MapReduce framework for learning from large datasets. The paper is not specific to Hadoop, but it certainly points to how Hadoop can be further extended for large-scale machine learning projects.

July 29, 2007

Technology Review article on Powerset

Filed under: Search — chucklam @ 2:24 pm

OK, I know this is a bit much – three posts in a row that’s related to Powerset, and there will still be a Part 2 to the Powerset demo post. But an article on Powerset just came out from Technology Review, and there are a few interesting nuggets in it.

A key component of the search engine is a deep natural-language processing system that extracts the relationships between words; the system was developed from PARC’s Xerox Linguistic Environment (XLE) platform. The framework that this platform is based on, called Lexical Functional Grammar, enabled the team to write different grammar engines that help the search engine understand text. This includes a robust, broad-coverage grammar engine written by PARC.

The article also mentions a semantic search engine being developed by IBM.

IBM is also in the midst of developing a semantic search engine, code-named Avatar, which is targeted at enterprise and corporate customers; it’s currently in beta testing within IBM. Project manager Shivakumar Vaithyanathan says that the hardest problems to overcome with natural-language search are finding a way to extract higher-level semantics from large documents while at the same time preserving precision and speed.

July 27, 2007

From information extraction to semantic representation

Filed under: Information Retrieval, Search — chucklam @ 6:14 pm

I suppose this is a huge coincidence that on the day I posted videos of Powerset’s first public demo, I get this workshop announcement in my inbox:

From Information Extraction to Semantic Representation
The Workshop focuses on the interface between the information extracted from content objects and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The Workshop will provide an opportunity for discussing adequate methods, processes (pipelines) and representation formats for the annotation process.

Automating the process of semantic annotation of content objects is a crucial step for bootstrapping the Semantic Web. This process requires a complex flow of activities which combines competences from different areas. The Workshop will focus precisely on the interface between the information extracted from content objects (e.g., using methods from NLP, image processing, text mining, etc.) and the semantic layer in which this information is explicitly represented in the form of ontologies and their instances. The workshop will provide an opportunity for: discussing adequate methods, processes (pipelines) and representation formats for the annotation process; reaching a shared
understanding with respect to the terminology in the area; discussing the lessons learned from projects in the area, and putting together a list of the most critical issues to be tackled by the research community to make further progress in the area.

Some of the possible challenges to be discussed at the workshop are:

  • How can ontological/domain knowledge be fed back into the extraction process?
  • How can the semantic layer be extended by the results of information extraction (e.g. ontology learning)?
  • What are the steps of an annotation process, which steps can be standardized for higher flexibility? Which parts are intrinsically application-specific?
  • What are the requirements towards formats for representing semantic annotations?
  • Where is the borderline of automation and how can it be further pushed?
  • How can the linking with the semantic layer be supported on the concept/schema level as well as on the instance level?
  • How can knowledge extracted from different sources with different tools and perhaps different reference ontologies (interoperability) be merged (semi-)automatically?
  • How can extraction technologies for different media (e.g. text and images) be combined and how can the merged extraction results be represented in order to create synergies?

I don’t have much expertise in information extraction or semantic web, but it seems like the challenges stated are quite appropriate for Powerset to look at.

Powerset demo (Part 1, the videos)

Filed under: Search — chucklam @ 3:53 am

I went to Powerset‘s first public demo on Tuesday. In Part 1 here, I’ll just post the videos with brief descriptions. In Part 2, I’ll write what I think of their technology.

I barely made it to the event on time and was stuck in the back of the crowd holding the camcorder above other people’s heads, so the recording of the introductory presentation was not very stable. It was basically a PowerPoint with some live demonstration of a few queries where Powerset got much better results than Google. (No surprises there.) Note that all the demos throughout the evening were only searching over Wikipedia. The side-by-side comparisons always have the Powerset results on the left side and Google results on the right side. Unfortunately, due to compression, you probably won’t be able to read a lot of the search results in the videos in this post. I’ll try to describe them in more details in Part 2.

[ ?posts_id=322855&dest=-1]

At the end of the presentation, Barney Pell, the Powerset CEO, came through the back of the crowd. When I spotted him, I figured this is the perfect time for my journalistic debut. So I interviewed him about the event and asked for his Powerset elevator pitch.

[ ?posts_id=322842&dest=-1]

OK, the lighting got a bit funky in the second half of that clip, and I won’t be putting Bambi Francisco out of a job any time soon…

The interesting part of the night was the demo station where they allow people to compare search results between Powerset and Google. The queries were limited to the form of “What did ___ say?” and people were welcomed to fill in the blank with famous names. The first two names mentioned in the clip were Sergey Brin and Larry Page. The third name tried was the legendary samurai Ieyasu Tokugawa.

After each query, the user was encouraged to vote using one of the three buttons underneath the search box. He could vote that 1) Powerset results were better, 2) Google results were better, or 3) It’s a tie, and the buttons kept track of the counts. In almost all cases, the results were either a tie or Powerset had better answers. The Powerset person assisting the demo and explaining things was Mark Johnson, a product manager for Powerlabs.

[ ?posts_id=322866&dest=-1]

More demo. This time someone asked “What did Pooh say?”

[ ?posts_id=322872&dest=-1]

Even with the side-by-side comparison between Powerset and Google, the official line is that they “love Google…”

[ ?posts_id=322831&dest=-1]

At the end of all the videotaping, I wish Powerset can answer my query “Who ate all the food??” 😉

[ ?posts_id=322823&dest=-1]

July 26, 2007

Search Engine Land started new column to be written by employees of search engines

Filed under: Search — chucklam @ 3:41 pm

Search Engine Land has started a new column to be written by employees of search engines.

Our newest Search Engine Land column, Search On Search, launches today. Search On Search is a column written by employees of search engines. Columnists are free to discuss technical details of how their search engine works or write opinion pieces that may or may not reflect the official position…

The first post of the column is by Gary Price of He wrote about a subscription-based archiving service called Archive-it that’s provided by The Internet Archive for institutional clients. According to the Archive-it web site, “Subscribers can capture, catalog, and archive their institution’s own web site or build collections from the web, and then search and browse the collection when complete.”

July 24, 2007

Web seminars on data mining

Filed under: Bayesian networks, Datamining — chucklam @ 2:05 am

Via KDnuggets: ACM’s Special Interest Group on Knowledge Discovery and Data Mining have two interesting webinars coming up. One is on exploiting link data in data mining. The other is a tutorial on learning Bayesian networks. You can register for either event here. They’re both free and given by noted experts in those areas. More info below.

Exploring the Power of Links in Data Mining
Thursday, July 26, 2007 11:30 am ET (duration 1 hour)
Jiawei Han
University of Illinois at Urbana-Champaign Register at (free)

Algorithms like PageRank and HITS have been developed in late 1990s to explore links among Web pages to discover authoritative pages and hubs. Links have also been popularly used in citation analysis and social network analysis. We show that the power of links can be explored thoroughly at data mining in classification, clustering, information integration, and other interesting tasks. Some recent results of our research that explore the crucial information hidden in links will be introduced, including (1) multi-relational classification, (2) user-guided clustering, (3) link-based clustering, and (4) object distinction analysis. The power of links in other analysis tasks will also be discussed in the talk.

Jiawei Han, Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, data mining from spatiotemporal data, multimedia data, stream and RFID data, Web data, social network data, and biological data, with over 300 journal and conference publications.

He has chaired or served on over 100 program committees of international conferences and workshops, including PC co-chair of 2005 (IEEE) International Conference on Data Mining (ICDM), Americas Coordinator of 2006 International Conference on Very Large Data Bases (VLDB). He is also serving as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data. He is an ACM Fellow and has received 2004 ACM SIGKDD Innovations Award and 2005 IEEE Computer Society Technical Achievement Award. His book “Data Mining: Concepts and Techniques” (2nd ed., Morgan Kaufmann, 2006) has been popularly used as a textbook worldwide.

Please register at (webcast is free)

The second webinar…

Learning Bayesian Networks
Wed, Aug 1, 2007, 1 pm PT, 4 pm ET (duration 1 hour)
Richard E. Neapolitan
Northeastern Illinois University Register at (free)

Bayesian networks are graphical structures for representing the probabilistic relationships among a large number of variables and doing probabilistic inference with those variables. The 1990’s saw the emergence of excellent algorithms for learning Bayesian networks from passive data. In 2004 I unified this research with my text Learning Bayesian Networks. This tutorial is based on that text and my paper.

Neapolitan, R.E., and X. Jiang, “A Tutorial on Learning Causal Influences,” in Holmes, D. and L. Jain (Eds.): Innovations in Machine Learning, Springer-Verlag, New York, 2005.

I will discuss the constraint-based method for learning Bayesian networks using an intuitive approach that concentrates on causal learning. Then I will show a few real examples.

Richard E. Neapolitan is Professor and Chair of Computer Science at Northeastern Illinois University. He has previously written three books including the seminal 1990 Bayesian network text Probabilistic Reasoning in Expert Systems. More recently, he wrote the 2004 text Learning Bayesian networks, and Foundations of Algorithms, which has been translated to three languages and is one of the most widely-used algorithms texts world-wide. His books have the reputation of making difficult concepts easy to understand because of the logical flow of the material, the simplicity of the explanations, and the clear examples.

July 22, 2007

Search engines improve privacy of search data

Filed under: Privacy, Search — chucklam @ 11:17 pm

Via today’s WSJ article (subscription required),

Today, Microsoft Corp. will announce new policies and technologies to protect the privacy of users of its Live Search services, say executives at the software maker. The company, along with IAC/InterActiveCorp.’s, will also announce plans to try to kick-start an industrywide initiative to establish standard practices for retaining users’ search histories.

Meanwhile, Yahoo Inc. this week will begin detailing its plans for a policy to make all of a user’s search data anonymous within 13 months of receiving it, except where users request otherwise or where the company is required to retain the information for law-enforcement or legal processes, according to a spokesman.

It’s a welcome effort for the search engines to start valuing privacy. The WSJ article also points out an interesting strategic implication,

Both [Microsoft and Ask] lag far behind Google and Yahoo in Internet-search market share and thus have far less data about search behaviors than their rivals. By calling for more defined standards on privacy, Microsoft could indirectly limit Google’s ability to use its vast stores of information to improve its services.

Not only does Google have more data due to their higher market share, they’ve also been collecting such data far longer than anyone else. (I mean usable data. Of course Microsoft and Yahoo have been around a lot longer than Google, but they don’t seem to have had a coherent data collection policy for a long time and much of their older data may not be very usable.) Google’s recent strategic emphasis on personalization would also be blunted if they’re limited in how much data they can collect. “Privacy” policies that lessen the usefulness of older data would hurt Google a lot more than it would hurt Microsoft.

Content creators and consumers have different biases

Filed under: Datamining, Information Retrieval, People and Data, Search — chucklam @ 6:16 pm

When doing Web analysis, if one only examines Web content, one needs to keep in mind that the results will be biased by the viewpoints of the content creators (or the “community” of content creators), which don’t necessarily reflect the viewpoints of the content consumers. If you don’t think there’s a gap between author and audience on the Web, simply do a Google search on “Robert”. Rather than getting results for famous Roberts like Robert DeNiro or Robert Kennedy, the top three results point to this one blogger, who’s somehow popular in the blogosphere but is pretty irrelevant otherwise.

To counter this content-creator bias, one should look at usage in addition to content. Recently I’m starting to see research that reveals improvements one can get when usage data is used in addition to content data. (I’m specifically addressing technologies outside of collaborative filtering, which already has a habit of using usage data.)

  • Matt Richardson, Amit Prakash, and Eric Brill had a paper in WWW2006 called Beyond PageRank: Machine Learning for Static Ranking, where they developed a better Web ranking system than PageRank by a combination of novel algorithm and data sources. An interesting data source they used is the popularity data that MSN toolbar collects (with permission) from users. These data reveal the frequency at which users actually visit particular sites, rather than how often a site gets linked to. The table below shows the top 10 URLs for PageRank versus fRank (Richardson et al’s system). While many factors are involved, there are obviously different biases between Web authors and users. Web authors are notably more tech-oriented than the average Web users.

    PageRank fRank
  • Xuanhui Wang and ChengXiang Zhai will be presenting a paper in SIGIR called Learn from Web Search Logs to Organize Search Results (pdf), where they use query log information to cluster search results. The idea of clustering search results has been around for a long time. I can immediately think of Marti Hearst’s scatter/gather system, although there are probably older work than that. Historically the clustering is done based on the content of the search results. Xuanhui Wang and ChengXiang Zhai examine actual terms that people use to search and come up with cluster categories that are more natural to the users. An example they cite is the search for “area codes.” The top three clusters based on content analysis are “city, state“, “local, area“, and “international.” By looking at the query log, Wang and Zhai’s algorithm came up with the following three clusters: “telephone, city, international“, “phone, dialing“, and “zip, postal.” The log-based method much more accurately reflects people’s desire to look up either “phone codes” or “zip codes” when they search for “area codes.” In addition to getting better clusters, Wang and Zhai also found the query log-based approach to generate more meaningful labels for the clusters than the content-based approach.

To be fair, researchers usually have a very hard time getting usage data. It’s notable that both papers cited above have used Microsoft data, and one of them is even done by researchers outside of Microsoft. Hopefully more companies will let researchers look at their data and publish their analyses.

July 20, 2007

EU authorizes $165M for search engine research

Filed under: Search — chucklam @ 6:29 pm

I first read the story on TechCrunch, which has a particularly Silicon Valley-ish interpretation of it. From an AP report,

The European Union on Thursday authorized Germany to give $165 million for research on Internet search-engine technologies that could someday challenge U.S. search giant Google Inc.

The Theseus research project — the German arm of what the French call Quaero — is aiming to develop the world’s most advanced multimedia search engine for the next-generation Internet. It would translate, identify and index images, audio and text.

The TechCrunch post portraits this as a governmental attempt to fight back against Google, which is not totally wrong. After all, Google has more than 90% market share in Germany and some other EU countries. Couple with growing anti-American sentiment, it is perfectly reasonable for Germany and France to want to seek an alternative. Imagine if Baidu or Mixi or Naver has 90% market share in the US. Americans will probably react quite strongly also.

However, casual chat with some European researchers shows that the European sentiment is more nuance. Yes they’d appreciate more viable alternatives in the market, but in some sense they’ve actually given up already. Google had won fairly, and European researchers are now so far behind that they think it’s pointless to challenge Google directly. The point of these government projects is less about usurping Google than to prevent a similar kind of dominance in the next generation of the Web. Therefore they don’t do much research on “Web search” per se, that is, in query log analysis, search result ranking, search advertising, etc. that Microsoft, Yahoo, and many American researchers still look at. Rather, they focus much more on semantic web and multimedia information retrieval, technologies that are more about the next generation of the Web, and where no one has clear dominance yet.

Older Posts »

Create a free website or blog at