January 20, 2008

Datamining voters

Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.

A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”

“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”

Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.

January 17, 2008

The Google Online Marketing Challenge

Google is encouraging universities to teach their students “online” marketing (i.e. how to use AdWords) by hosting an Online Marketing Challenge:

Here’s how the Challenge works: Your students will receive US$ 200 in Google ads to drive traffic to a business website of their choosing. Students will compete with groups from their institution along with student teams from all over the world.

This is not a simulation; students gain real-world experience with a real client. You have a great student project; clients gain free advertising and Internet consulting. Google provides US $200 in vouchers, teaching materials and other resources.

Encryption versus the Fifth Amendment

As usual, Nick Carr has spotted some interesting news:

As the Washington Post reports today, the encryption conflict is now coming to a head. [Sebastien Boucher], accused of storing child pornography on his computer, has refused to provide police with the password required to unlock the encrypted files on his hard drive. He claims that disclosing the password would violate his Fifth Amendment right to avoid self-incrimination. A judge backed his claim, and the government is now appealing that ruling in the federal courts.

The judge’s reasoning, as cited in the original Washington Post article, is quite fascinating:

In his ruling, [Judge] Niedermeier said forcing Boucher to enter his password would be like asking him to reveal the combination to a safe. The government can force a person to give up the key to a safe because a key is physical, not in a person’s mind. But a person cannot be compelled to give up a safe combination because that would “convey the contents of one’s mind,” which is a “testimonial” act protected by the Fifth Amendment, Niedermeier said .

The judge also said that “If Boucher does know the password, he would be faced with the forbidden trilemma: incriminate himself, lie under oath, or find himself in contempt of court.”

I’ve heard of many dilemmas, but trilemma… that’s a new one for me.

January 15, 2008

Jon Kleinberg to speak at Stanford tomorrow

Date & time: Wednesday, January 16 4:35-5:45 pm
Location: 380-380C
Speaker: Jon Kleinberg, Cornell University

Title: Computational Perspectives on Large-Scale Social Network Data


The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in
formulating models of social processes and in managing complex networks as datasets.

We consider two lines of research within this general theme. The first is concerned with modeling the flow of information through a large network: the spread of new ideas, technologies, opinions, fads, and rumors can be viewed as unfolding with the dynamics of epidemic, cascading from one individual to another through the network. This suggests a basis for computational models of such phenomena, with the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving.

The second line of research we consider is concerned with the privacy implications of large network datasets. An increasing amount of social network research focuses on datasets obtained by measuring the interactions among individuals who have strong expectations of privacy. To preserve privacy in such instances, the datasets are typically anonymized — the names are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed. Unfortunately, there are fundamental limitations on the power of network anonymization to preserve privacy; we will discuss some of these limitations (formulated in joint work with Lars Backstrom and Cynthia Dwork) and some of their broader implications.

Speaker bio:

Jon Kleinberg is a Professor in the Department of Computer Science at Cornell University. His research interests are centered around issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a Fellow of the American Academy of Arts and Sciences, and the recipient of MacArthur, Packard, and Sloan Foundation Fellowships, the Nevanlinna Prize from the International Mathematical Union, and the National Academy of Sciences Award for Initiatives in Research. – for people who love large data sets

Aaron Swartz just announced a new Web site ( he created for people who love large data sets, “the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.”

The site is very spartan right now, but it can certainly become very interesting if it attracts the right contributors. I hope he succeeds in building a community around it.

MetaWeb receives $42M investment

VentureBeat just reported that MetaWeb Technologies received a $42.4M second round investment. I haven’t had time to fully play around with its Freebase database yet, but they certainly seem to be building a war chest for something.

January 12, 2008

The piracy root of Hollywood

Via a post by Matt Mason on TorrentFreak:

[Thomas] Edison… went on to invent filmmaking, and demanded a licensing fee from those making movies with his technology. This caused a band of filmmaking pirates, including a man named William, to flee New York for the then still wild West, where they thrived, unlicensed, until Edison’s patents expired. These pirates continue to operate there, albeit legally now, in the town they founded: Hollywood. William’s last name? Fox.

January 5, 2008

How ‘free’ should data be?

I just read an article in Wired called “Should Web Giants Let Startups Use the Information They Have About You?” The article is really more about issues surrounding Web scraping and data API’s than what the title suggests. (There’s a lot more data out there than just personal information.)

There are many ways to make data more valuable. One can aggregate them. One can add structure/metadata to them. One can filter them. One can join/link different data types together. It’s unrealistic to think that any one organization has gotten all possible value out of any particular data. Someone can always come along and process the data further to generate additional value.

Now, if that’s the case, then it’s economically inefficient (for the general society) if an organization forbids further use of their data. Unfortunately, the debate today tends to be polarized into the free use versus no use camps. One hopes that some kind of market mechanism will emerge as a reasonable middle ground. Today we know so little about the dynamics and the economics of data that it’s not clear what that market will look like and what rules are needed to keep it functioning healthily, but these are issues we need to address rather than throw around rhetoric about ‘freedom’.

