Data Strategy

November 19, 2007

IEEE Computer special issue on search

Filed under: Advertising, Collective wisdom, People and Data, Personalization, Search — chucklam @ 6:08 pm

I’m quite behind on a lot of my readings, so I only got around to reading the IEEE Computer’s (August) special issue on search this weekend. (Abstracts are free but actual PDF’s require an expensive subscription or an expensive purchase.) It includes the following articles:

  • Search Engines that Learn from Implicit Feedback
  • A Community-Based Approach to Personalizing Web Search
  • Sponsored Search: Is Money a Motivator for Providing Relevant Results?
  • Deciphering Trends in Mobile Search
  • Toward a PeopleWeb

The articles were written by a mix of university academics and researchers from Google and Yahoo. They seem targeted at giving the general practitioner a sampling of some of current research, rather than being comprehensive in any specific domain or deep in a particular research area.

For me, the most interesting article is “Search Engines that Learn from Implicit Feedback” by Thorsten Joachims and Filip Radlinski of Cornell University. It’s a very accessible summary of the research those two have been doing in the last few years. To start off their research, they used eye-tracking experiments to characterize how people react to search engine rankings. They found that the ranking order strongly biases what people view and therefore click on. A result in the top ranking will often be clicked on more often than a better result in the second or third ranking, as some users may not even have looked at the results beyond the first ranking. A straightforward assumption that a click is the equivalent of a positive vote is therefore naive. Instead, they examine results that were not clicked on but should have. For example, if results at ranking 3 and 4 are clicked on, but not the result at ranking 2, then one can be sure that the result at ranking 2 is worse than the ones at ranking 3 and 4 and can use that knowledge to improve the search engine. Note that if the result at ranking 1 was clicked on, nothing new is learned. People are so biased towards clicking the first result that only if it was not clicked on would that be considered informative.

Under that model, they can interleave the results from two different search engines (or algorithms) and evaluate which one is better based on users’ clickthroughs. This insight led them to develop a ranking SVM model to learn search engine rankings. The new algorithm was shown to create a better meta-search engine as well as a better domain-specific search engine.


November 5, 2007

Panel on Web 3.0

Filed under: Collective wisdom, Data Collection, People and Data — chucklam @ 1:04 am

Not sure if I’ll have time to go to this, but this seems like the one event to figure out what “Web 3.0” is about.

The MIT/Stanford Venture Lab presents:
“Web 3.0: New Opportunities on the Semantic Web”

Date: Tuesday, November 20
Time: 6:00 PM
Location: Bishop Auditorium, Stanford University

* Robert Cook, Co-founder and Executive VP of Product Development, Metaweb
* Nova Spivack , CEO and Founder, Radar Networks
* Alex Iskold , CEO and Founder, Adaptive Blue
* Paul Kedrosky , Venture Partner, Ventures West

We are well into the current era of the Web, commonly referred to as Web 2.0. What lies on the horizon? Will Web 3.0 usher in the long awaited vision of the semantic web, as proposed by “Father of the Web” Tim Berners-Lee more than ten years ago?

Join us for a lively panel session where some of the best emerging companies in the semantic web space present their different approaches to realizing the vision. The panel will address questions such as: How can we best implement the vision of the semantic web? What will we do with the web once it is structured with semantic information? What new applications will appear? Where is the consumer value and how should it be marketed? What new businesses can be built on top of the semantic web that are not possible today? Will the semantic web ultimately bring about a new intelligence that surpasses that of humanity, sparking a new era of non-biological evolution?

Join us and bring questions of your own – help us uncover the future of the web!

Cost: $30 pre-registered; $40 at the door

More info:

October 8, 2007

Jimmy Wales on “Free Culture and the Future of Search”

Filed under: Collective wisdom, Information Retrieval, Open Data, Search — chucklam @ 2:10 am

Just got word that Jimmy Wales, founder of Wikipedia and Wikia, will be speaking at The Stanford Law and Technology Association this Thursday (Oct. 11) on the topic of “Free Culture and the Future of Search.” The talk will be in Room 190, Stanford Law School, at 12:45pm. Worth checking out if you’re in the area.

August 15, 2007

Chinese web encyclopedia

Filed under: Collective wisdom, Open Data — chucklam @ 4:26 am

PC Advisor has an article about Wikipedia accusing Baidu Baike, the user-generated Chinese web encyclopedia, of copyright violations. While the article focuses on the accusation that some of Baidu Baike’s content were copied from Wikipedia without attribution, I found the description of Baidu Baike quite interesting itself, as I wasn’t aware of the site’s existence until now.

Baidu Baike [is] the largest online Chinese-language encyclopedia. [It] contains more articles than any Wikipedia except the English-language Wikipedia. Baidu Baike boasted 809,237 entries as of Sunday, edging out the German edition of Wikipeida, which has 619,612 entries, for second place.

And Baidu Baike was able to get to its size in spite of extra hurdles for contributors.

Anyone wishing to publish entries on Baidu Baike must register first, giving the site people’s real names, and site administrators review all entries before posting, a way to ensure compliance with Chinese censorship laws.

The copyright accusation, if true, would create some philosophical dilemmas for Wikipedia. Since the Chinese version of Wikipedia is blocked in China, Baidu Baike can in fact be considered as a proxy for people to access Wikipedia’s content. If their disagreement is not successfully resolved, then hypothetically Wikipedia may have to choose between protecting its copyright license and promoting easy access to its content for the people of China.

However, Wikipedia may in fact be too toothless to take action any which way. It has to rely on its moral authority, as its legal power is quite weak.

“The foundation does not hold a copyright on the articles, the editors or the authors do, so there is very little we can do,” said Nibart-Devouar [, chair of the Board of Trustees at the Wikimedia Foundation.]

July 5, 2007

Crowd’s wisdom in South Korean

Filed under: Collective wisdom, Search — chucklam @ 5:21 pm

The International Herald Tribune has an article on Naver called Crowd’s wisdom helps South Korean search engine beat Google and Yahoo. Earlier I already had a post on Culture and language on the Web that had talked about Naver. It’s great to see that local rivals are outwitting and out-innovating against international giants. At least Yahoo is learning its lesson and is taking its loss as inspiration for new products. Google is just failing miserably at 1.7% market share in South Korea.

June 16, 2007

Collective intelligence in Word 2007 spell checker

Filed under: Collective wisdom, Data Collection, Open Data — chucklam @ 5:30 pm

Via Gregor Hochmuth thru O’Reilly Radar, pointing out the use of collective intelligence in improving Word 2007’s spell checker:

I thought you might enjoy this: When I was closing Word 2007 today, I was surprised to see the attached dialog pop-up. Microsoft’s new spell checker asked me whether it could transmit certain unknown words and phrases that I used in the last several weeks.

Among them are choice examples like Wikipedia, Gladwell, shortcode and others — words that were certainly not in the original distributable. I assume Microsoft will re-distribute the most frequently submitted words in an upcoming spell checker update. Brilliant! And it reminds of the way in which Google first introduced its “Did you mean…?” feature– by tracking how users corrected their own spelling mistakes before re-trying a search.

Tim O’Reilly notes that a distinguishing feature of Web 2.0 apps (“Live Software”) is that they improve organically when more people use them more often. That is, the apps are architected to grow on usage data. This is certainly an idea I’ll have more to say in the future.

Blog at