January 5, 2008

How ‘free’ should data be?

I just read an article in Wired called “Should Web Giants Let Startups Use the Information They Have About You?” The article is really more about issues surrounding Web scraping and data API’s than what the title suggests. (There’s a lot more data out there than just personal information.)

There are many ways to make data more valuable. One can aggregate them. One can add structure/metadata to them. One can filter them. One can join/link different data types together. It’s unrealistic to think that any one organization has gotten all possible value out of any particular data. Someone can always come along and process the data further to generate additional value.

Now, if that’s the case, then it’s economically inefficient (for the general society) if an organization forbids further use of their data. Unfortunately, the debate today tends to be polarized into the free use versus no use camps. One hopes that some kind of market mechanism will emerge as a reasonable middle ground. Today we know so little about the dynamics and the economics of data that it’s not clear what that market will look like and what rules are needed to keep it functioning healthily, but these are issues we need to address rather than throw around rhetoric about ‘freedom’.

October 8, 2007

Jimmy Wales on “Free Culture and the Future of Search”

Just got word that Jimmy Wales, founder of Wikipedia and Wikia, will be speaking at The Stanford Law and Technology Association this Thursday (Oct. 11) on the topic of “Free Culture and the Future of Search.” The talk will be in Room 190, Stanford Law School, at 12:45pm. Worth checking out if you’re in the area.

August 15, 2007

Chinese web encyclopedia

PC Advisor has an article about Wikipedia accusing Baidu Baike, the user-generated Chinese web encyclopedia, of copyright violations. While the article focuses on the accusation that some of Baidu Baike’s content were copied from Wikipedia without attribution, I found the description of Baidu Baike quite interesting itself, as I wasn’t aware of the site’s existence until now.

Baidu Baike [is] the largest online Chinese-language encyclopedia. [It] contains more articles than any Wikipedia except the English-language Wikipedia. Baidu Baike boasted 809,237 entries as of Sunday, edging out the German edition of Wikipeida, which has 619,612 entries, for second place.

And Baidu Baike was able to get to its size in spite of extra hurdles for contributors.

Anyone wishing to publish entries on Baidu Baike must register first, giving the site people’s real names, and site administrators review all entries before posting, a way to ensure compliance with Chinese censorship laws.

The copyright accusation, if true, would create some philosophical dilemmas for Wikipedia. Since the Chinese version of Wikipedia is blocked in China, Baidu Baike can in fact be considered as a proxy for people to access Wikipedia’s content. If their disagreement is not successfully resolved, then hypothetically Wikipedia may have to choose between protecting its copyright license and promoting easy access to its content for the people of China.

However, Wikipedia may in fact be too toothless to take action any which way. It has to rely on its moral authority, as its legal power is quite weak.

“The foundation does not hold a copyright on the articles, the editors or the authors do, so there is very little we can do,” said Nibart-Devouar [, chair of the Board of Trustees at the Wikimedia Foundation.]

July 31, 2007

Mashups getting attention from mainstream news media

I’ve come across two articles in the mainstream news media in less than a week on mashups. The New York Times has an article With Tools on Web, Amateurs Reshape Mapmaking that focuses on mapping mashups. The Wall Street Journal today has ‘Mashups’ Sew Data Together (subscription required).

It certainly seems like there are enough useful mashups already that mainstream press is taking notice at the concept. However, mashups are really a fuzzy collection of three related applications: data visualization, data integration, and data collection. The unifying concept being that mashups do them using lightweight Web interfaces.

The majority of existing mashups don’t do much more than data visualization, usually showing location-related information on a map. This doesn’t involve any data integration as it simply takes data from one source and feeds them to a visualization (i.e. mapping) engine.

Some mashups are starting to take data from more than one Web source (or even just scraping from different Web sites) and create new applications by integrating those data. Microformat is targeted at exactly this kind of mashup. As with traditional data integration projects, deep integration is usually thwarted by incompatible data definitions. The main hope is that lightweight integration, together with the large number of data sources, will produce a lot of mashups. Even though these mashups are not too sophisticated individually, collectively they will generate a lot of value. Think of the lightweight Web interfaces as the duct tape of data.

Finally, mashups are also about data collection. That’s never stated as any mashup’s primary goal but is often a byproduct of the useful ones. The mapping APIs, for example, encourage the creation of many geographically-related data collections, as the NYT article points out. People are willing to contribute data, but they do want to see immediate gratification from such contributions, and good mashups can show them that. Furthermore, as operating systems win by building an ecosystem of applications around them, Web APIs depend on an ecosystem of data sources to turn them into platforms. I’d encourage anyone developing mashup APIs to think deeply about how people can potentially create data collections on top of their interfaces. An incoherent set of APIs doesn’t make a platform, just like a collection of device drivers doesn’t make an operating system.

July 12, 2007

Open data astronomy

I started writing this post on open data astronomy some time ago, and damn… I got scooped by Read/WriteWeb today with their article on Galaxy Zoo and other “distributed brain” projects. Galaxy Zoo, like Stardust@Home and Clickworkers, asks volunteers over the Web to label astronomy features (galaxies, moon craters, etc.) on images. These data help astronomers do better analysis. I actually had referenced Clickworkers for my PhD thesis under the Open Mind Initiative.

Also see my previous post on open data genetics.

June 16, 2007

Collective intelligence in Word 2007 spell checker

Via Gregor Hochmuth thru O’Reilly Radar, pointing out the use of collective intelligence in improving Word 2007’s spell checker:

I thought you might enjoy this: When I was closing Word 2007 today, I was surprised to see the attached dialog pop-up. Microsoft’s new spell checker asked me whether it could transmit certain unknown words and phrases that I used in the last several weeks.

Among them are choice examples like Wikipedia, Gladwell, shortcode and others — words that were certainly not in the original distributable. I assume Microsoft will re-distribute the most frequently submitted words in an upcoming spell checker update. Brilliant! And it reminds of the way in which Google first introduced its “Did you mean…?” feature– by tracking how users corrected their own spelling mistakes before re-trying a search.

Tim O’Reilly notes that a distinguishing feature of Web 2.0 apps (“Live Software”) is that they improve organically when more people use them more often. That is, the apps are architected to grow on usage data. This is certainly an idea I’ll have more to say in the future.

June 14, 2007

Open data genetics

I had just read about the Personal Genome Project (PGP) a couple days ago, and it’s a really interesting open data project. According to its Wikipedia entry:

The project will publish the genotype (the full DNA sequence of all 46 chromosomes) of the volunteers, along with extensive information about their phenotype: medical records, various measurements, MRI images, etc. All data will be freely available over the Internet, so that researchers can test various hypotheses about the relationship between genotype and phenotype.

The published data will include identifyable information such as the volunteers’ name. The reason for doing so is that they can’t guarantee anonymity anyways when one’s genotype and phenotype are already open. In an interview in Technology Review, the project’s founder, Harvard University’s George Church, said:

We and others have raised concerns about the difficulty of maintaining anonymity [in medical records]. You promise subjects you will make the information anonymous, but it’s becoming increasingly easy to re-identify an individual. This project will hopefully raise consciousness on what we need to do to encourage insurance companies and government and employers to make this safer. This has already been done in some countries, so it’s just a matter of policy.

The first volunteers will be tenured human geneticists, who best understand the risk and benefits of this project. Harvard Medical School’s Institutioal Review Board had given the project permission to start, and it sounds like they will review its progress before the project will recruit a broader set of volunteers.

