Data Strategy

August 30, 2007

Analyzing microblogging

Filed under: Datamining, People and Data, Visualization — chucklam @ 2:40 pm

Akshay Java and colleagues at University of Maryland and NEC Laboratories presented a paper at KDD called Why We Twitter: Understanding Microblogging Usage and Communities. In the paper they analyzed the social graph of Twitter and all the public postings for two months from April 1 to May 30 of this year, so the data is very recent. The conclusions are pretty much in line with what we have come to expect from social networks (power law distribution, community clusters, etc.). In the discussion section, they classify Twitter usage into four categories:

  1. Daily chatter (dominant category)
  2. Conversations (1/8 of all posts which are started with the @ symbol follow by a username)
  3. Sharing information/URLs (13% of all posts)
  4. Reporting news (often by automated agents)

The interesting part to me was really how they collected the data. Apparently the whole Twitter network can be gotten from its API. It’s usually pretty difficult for researchers to get access to large-scale social graphs, so it’s nice to know that the Twitter social graph is easy to get. The network is smaller than I had expected though. Its total user base (at the end of May) is a little less then 88,000. The amount of attention Twitter gets seems to be much bigger than its number of users. In addition, the authors of the paper stated that “After an initial period of interest around March 2007, the rate at which new users are joining Twitter has slowed.”

August 2, 2007

MIT’s SIMILE Project: enriching internet information collections

Filed under: Data Collection, Visualization — chucklam @ 11:18 am

MIT’s SIMILE project is a set of tools for creating, managing, and reusing information collections. A recent interview with David Karger, SIMILE Project’s principal investigator, in Dr. Dobb’s Journal help explain the project’s significance.

The Web has been incredibly successful at making huge amounts of new information available to many people. But it still has a long way to go in depth and breadth. Regarding depth, there’s plenty of awareness of the “deep web” — stuff that doesn’t show up on the web search engines because it is buried in special-purpose databases. We think some of our tools can help bring that information to light. As for breadth, while the Web has made it much easier for people to contribute textual information through tools like blogs and wikis, it’s still not really possible for the lay person to contribute rich structured information collections. We think our tools can dramatically lower the barriers for a broader group of contributors to share the rich structured content they know.

David nailed two of the biggest data gaps on the Web today, and I couldn’t explain them any better. It’s a bit unfortunate that SIMILE’s web site doesn’t explain things as clearly and tends to have a lot of academic CS vernacular (about Semantic Web and RDF and stuff). After all, it’s the weekend Website developer that can really use these tools. Consider an example David mentions:

One of the tools we’re currently working on is called “Exhibit.” This is a tool that lets anyone take a collection of anything they care about and put it on the web as a rich, interactive, web-2.0 style site without doing any programming. All you do is put up a file containing your collection and a web page describing how you want it to look. The result may be pretty much what you’d expect of a web 2.0 site these day — until you realize that it avoids the whole team of database engineers and 3-tier web application developers, and lets you do it all yourself!

July 31, 2007

Mashups getting attention from mainstream news media

Filed under: Data Collection, Open Data, Visualization — chucklam @ 3:50 am

I’ve come across two articles in the mainstream news media in less than a week on mashups. The New York Times has an article With Tools on Web, Amateurs Reshape Mapmaking that focuses on mapping mashups. The Wall Street Journal today has ‘Mashups’ Sew Data Together (subscription required).

It certainly seems like there are enough useful mashups already that mainstream press is taking notice at the concept. However, mashups are really a fuzzy collection of three related applications: data visualization, data integration, and data collection. The unifying concept being that mashups do them using lightweight Web interfaces.

The majority of existing mashups don’t do much more than data visualization, usually showing location-related information on a map. This doesn’t involve any data integration as it simply takes data from one source and feeds them to a visualization (i.e. mapping) engine.

Some mashups are starting to take data from more than one Web source (or even just scraping from different Web sites) and create new applications by integrating those data. Microformat is targeted at exactly this kind of mashup. As with traditional data integration projects, deep integration is usually thwarted by incompatible data definitions. The main hope is that lightweight integration, together with the large number of data sources, will produce a lot of mashups. Even though these mashups are not too sophisticated individually, collectively they will generate a lot of value. Think of the lightweight Web interfaces as the duct tape of data.

Finally, mashups are also about data collection. That’s never stated as any mashup’s primary goal but is often a byproduct of the useful ones. The mapping APIs, for example, encourage the creation of many geographically-related data collections, as the NYT article points out. People are willing to contribute data, but they do want to see immediate gratification from such contributions, and good mashups can show them that. Furthermore, as operating systems win by building an ecosystem of applications around them, Web APIs depend on an ecosystem of data sources to turn them into platforms. I’d encourage anyone developing mashup APIs to think deeply about how people can potentially create data collections on top of their interfaces. An incoherent set of APIs doesn’t make a platform, just like a collection of device drivers doesn’t make an operating system.

June 21, 2007

Tools for visualizing and analyzing data collaboratively

Filed under: Datamining, Visualization — chucklam @ 12:40 am

Most data analysis, especially in the business world, are done through graphs: People load up data into Excel and graph them, looking for interesting and unusual patterns. Swivel is a web site that focuses on doing just that kind of data analysis online. Unlike Excel, it leverages some of the power of the web. For example, people can leave comments on each graph. People can digg the ones they like. Data with a geographic component can be plotted onto online maps. Etc.

However, what got me really excited today about this topic was seeing something else, a screencast of (screencast here and homepage here). It’s always hard to describe visualization tools, so I suggest you to just check out the screencast to see what it is.

If you have ever done collaborative data analysis, you’ll know the following drill. You have some data, see some patterns in it after a little exploration, and you create some summary graphs that highlight those patterns. You show those graphs around to either convince your colleagues of your hypothesis or to solicit some ideas. As always, other people will offer alternate hypotheses, give you contextual information that you didn’t know before (“A bug was on that sensor…”), raise new questions, and some may even want to look at the data themselves. So you go back to the data, re-do the graphs/analysis based on the feedback, and iterate the process. This can be time consuming. It’s kind of like collaborating by emailing a document around.

What’s needed is a data graphing tool that’s designed from the ground up to support the collaborative analysis of data, and this is what is. It’s a web-based tool that lets you generate new graphs quickly and interactively. More importantly, it supports collaboration by using well-known web tools such as bookmarking, commenting, and linking. To describe it in another way, is useful for data analysis like version control is useful for programming.

The developer of is Jeffrey Heer, a Ph.D. candidate at UC Berkeley. He’ll be talking about the system in a Yahoo Brain Jam session in Berkeley on June 29. More info here. I don’t know if I’ll have time to drive up to Berkeley for the talk yet, but we’ll see…

Blog at