Data Strategy

November 30, 2008

Have a big, open dataset? Make it available on Amazon!

Filed under: Data Collection — chucklam @ 6:18 pm

I was using the Amazon cloud services and came across this new feature that they haven’t publicized much yet. It’s the AWS Hosted Public Data Sets. They basically host public datasets for free that any AWS EC2 instance can access. Right now they have some datasets provided by The US Census Bureau and the Bureau of Labor Statistics. An exciting one they’ll be adding soon is the annotated human genome. If you have a large dataset in public domain, scroll to the bottom of this page to let Amazon know that you want to contribute. In fact, you can already publish your dataset to Amazon’s EBS storage yourself and just make it public. The downside is that you’ll have to pay for the hosting cost, which is strictly dependent on the size of your data but is generally cheap. You’ll also have to promote the dataset’s availability yourself, as it won’t be listed on the Amazon’s Web site.

What I wish to see in the short term is that various government agencies make their data available on Amazon. For example, the USPTO should release all its patent data. These data are already paid for by the tax payers and should be freely available to the public. If you know anyone working in government agencies, tell them to look into this Amazon service.

Of course, such dataset should also just be available openly on the Web, so users don’t have to pay Amazon just to get access to the data. Storage is cheap and downloading over the Web is easy. I’m still amazed by some organizations that insist on sending their data on DVDs (yes… I’m thinking of the USPTO again…)

On a more philosophical level, I’m thinking about how Amazon’s business has expanded from selling books to publishing books and now they’re publishing data too.

Create a free website or blog at