Data Strategy

November 30, 2008

Have a big, open dataset? Make it available on Amazon!

Filed under: Data Collection — chucklam @ 6:18 pm

I was using the Amazon cloud services and came across this new feature that they haven’t publicized much yet. It’s the AWS Hosted Public Data Sets. They basically host public datasets for free that any AWS EC2 instance can access. Right now they have some datasets provided by The US Census Bureau and the Bureau of Labor Statistics. An exciting one they’ll be adding soon is the annotated human genome. If you have a large dataset in public domain, scroll to the bottom of this page to let Amazon know that you want to contribute. In fact, you can already publish your dataset to Amazon’s EBS storage yourself and just make it public. The downside is that you’ll have to pay for the hosting cost, which is strictly dependent on the size of your data but is generally cheap. You’ll also have to promote the dataset’s availability yourself, as it won’t be listed on the Amazon’s Web site.

What I wish to see in the short term is that various government agencies make their data available on Amazon. For example, the USPTO should release all its patent data. These data are already paid for by the tax payers and should be freely available to the public. If you know anyone working in government agencies, tell them to look into this Amazon service.

Of course, such dataset should also just be available openly on the Web, so users don’t have to pay Amazon just to get access to the data. Storage is cheap and downloading over the Web is easy. I’m still amazed by some organizations that insist on sending their data on DVDs (yes… I’m thinking of the USPTO again…)

On a more philosophical level, I’m thinking about how Amazon’s business has expanded from selling books to publishing books and now they’re publishing data too.


January 20, 2008

Datamining voters

Filed under: Data Collection, Datamining, People and Data, Privacy — chucklam @ 11:33 pm

Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.

A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”

“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”

Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.

January 15, 2008 – for people who love large data sets

Filed under: Data Collection, Datamining, Infrastructure — chucklam @ 4:50 pm

Aaron Swartz just announced a new Web site ( he created for people who love large data sets, “the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.”

The site is very spartan right now, but it can certainly become very interesting if it attracts the right contributors. I hope he succeeds in building a community around it.

MetaWeb receives $42M investment

Filed under: Data Collection, People and Data, Search — chucklam @ 2:45 am

VentureBeat just reported that MetaWeb Technologies received a $42.4M second round investment. I haven’t had time to fully play around with its Freebase database yet, but they certainly seem to be building a war chest for something.

January 5, 2008

How ‘free’ should data be?

Filed under: Data Collection, Open Data, People and Data, Privacy — chucklam @ 2:25 pm

I just read an article in Wired called “Should Web Giants Let Startups Use the Information They Have About You?” The article is really more about issues surrounding Web scraping and data API’s than what the title suggests. (There’s a lot more data out there than just personal information.)

There are many ways to make data more valuable. One can aggregate them. One can add structure/metadata to them. One can filter them. One can join/link different data types together. It’s unrealistic to think that any one organization has gotten all possible value out of any particular data. Someone can always come along and process the data further to generate additional value.

Now, if that’s the case, then it’s economically inefficient (for the general society) if an organization forbids further use of their data. Unfortunately, the debate today tends to be polarized into the free use versus no use camps. One hopes that some kind of market mechanism will emerge as a reasonable middle ground. Today we know so little about the dynamics and the economics of data that it’s not clear what that market will look like and what rules are needed to keep it functioning healthily, but these are issues we need to address rather than throw around rhetoric about ‘freedom’.

November 14, 2007

reCAPTCHA gone awry…

Filed under: Data Collection, Pattern recognition, People and Data — chucklam @ 12:33 am

I think reCAPTCHA is a very clever idea to layer data collection on top of an authentication system. However, sometimes the security check is just a bit too puzzling. I came across this today on Facebook. How am I suppose to type the answer in?? 😉


November 5, 2007

Panel on Web 3.0

Filed under: Collective wisdom, Data Collection, People and Data — chucklam @ 1:04 am

Not sure if I’ll have time to go to this, but this seems like the one event to figure out what “Web 3.0” is about.

The MIT/Stanford Venture Lab presents:
“Web 3.0: New Opportunities on the Semantic Web”

Date: Tuesday, November 20
Time: 6:00 PM
Location: Bishop Auditorium, Stanford University

* Robert Cook, Co-founder and Executive VP of Product Development, Metaweb
* Nova Spivack , CEO and Founder, Radar Networks
* Alex Iskold , CEO and Founder, Adaptive Blue
* Paul Kedrosky , Venture Partner, Ventures West

We are well into the current era of the Web, commonly referred to as Web 2.0. What lies on the horizon? Will Web 3.0 usher in the long awaited vision of the semantic web, as proposed by “Father of the Web” Tim Berners-Lee more than ten years ago?

Join us for a lively panel session where some of the best emerging companies in the semantic web space present their different approaches to realizing the vision. The panel will address questions such as: How can we best implement the vision of the semantic web? What will we do with the web once it is structured with semantic information? What new applications will appear? Where is the consumer value and how should it be marketed? What new businesses can be built on top of the semantic web that are not possible today? Will the semantic web ultimately bring about a new intelligence that surpasses that of humanity, sparking a new era of non-biological evolution?

Join us and bring questions of your own – help us uncover the future of the web!

Cost: $30 pre-registered; $40 at the door

More info:

October 18, 2007

How Acxiom uses offline data for behavioral targeting

Filed under: Advertising, Data Collection, Datamining, People and Data, Personalization — chucklam @ 12:37 am

A really fascinating piece at the WSJ yesterday: Firm Mines Offline Data To Target Online Ads (subscription req.). There’s a particular side-bar that reveals how Acxiom uses offline data for behavioral targeting:

How Acxiom delivers personalized online ads:

  1. Acxiom has accumulated a database of about 133 million households and divided it into 70 demographic and lifestyle clusters based on information available from… public sources.
  2. A person gives one of Acxiom’s Web partners his address by buying something, filling out a survey or completing a contest form on one of the sites.
  3. In an eyeblink, Acxiom checks the address against its database and places a “cookie,” or small piece of tracking software, embedded with a code for that person’s demographic and behavioral cluster on his computer hard drive.
  4. When the person visits an Acxiom partner site in the future, Acxiom can use that code to determine which ads to show…
  5. Through another cookie, Acxiom tracks what consumers do on partner Web sites…

It’s an interesting approach. While Acxiom has offline information on “133 million households,” it’s not clear how many households actually have gotten the offline-tracking cookie.

One can imagine Google really taking advantage of an approach like this. The wide spread use of Google Analytics across web sites already gives Google the potential to track your surfing habit. Specifying your home address when you use Google Maps or Checkout allows Google to match you with offline marketing databases. And we haven’t even talked about your query data and your emails yet…

October 2, 2007

Map data company Navteq bought for $8.1B

Filed under: Data Collection — chucklam @ 12:55 am

Nokia has bought map data company Navteq for $8.1 billion, a valuation of about 14 times its 2006 revenue of $582 million. A NYT article says the deal signals a “strategic bet” for Nokia. Both Google and Yahoo buy map data from Navteq to power their mapping services. The acquisition’s high price to revenue ratio is another demonstration that strategic collection of data is highly valuable. I wrote earlier on how speech recognition company Nuance’s high valuation was also due to its collection of (speech) data.

Speaking of map data, I found an interesting trivia when I started reading Stuart Skorman’s Confessions of a Serial Entrepreneur over the weekend. Apparently the makers of printed maps (back in the days…) had worried about other printers simply copying the maps that they had spent so much effort creating. Their response was to put in fictitious towns on their map. Seeing those fictitious towns in a competitor’s map was a sure sign of copyright infringement.

Update: The WSJ (sub. required) noted that the acquisition price is 50 times Navteq’s estimated earnings for this year.

September 21, 2007

Google to enable access to its social graph data

According to a TechCrunch post:

Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.

Not much info yet, but can’t wait…

Older Posts »

Create a free website or blog at