Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.
A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”
“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”
Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.
Date & time: Wednesday, January 16 4:35-5:45 pm
Speaker: Jon Kleinberg, Cornell University
Title: Computational Perspectives on Large-Scale Social Network Data
The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in
formulating models of social processes and in managing complex networks as datasets.
We consider two lines of research within this general theme. The first is concerned with modeling the flow of information through a large network: the spread of new ideas, technologies, opinions, fads, and rumors can be viewed as unfolding with the dynamics of epidemic, cascading from one individual to another through the network. This suggests a basis for computational models of such phenomena, with the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving.
The second line of research we consider is concerned with the privacy implications of large network datasets. An increasing amount of social network research focuses on datasets obtained by measuring the interactions among individuals who have strong expectations of privacy. To preserve privacy in such instances, the datasets are typically anonymized — the names are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed. Unfortunately, there are fundamental limitations on the power of network anonymization to preserve privacy; we will discuss some of these limitations (formulated in joint work with Lars Backstrom and Cynthia Dwork) and some of their broader implications.
Jon Kleinberg is a Professor in the Department of Computer Science at Cornell University. His research interests are centered around issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a Fellow of the American Academy of Arts and Sciences, and the recipient of MacArthur, Packard, and Sloan Foundation Fellowships, the Nevanlinna Prize from the International Mathematical Union, and the National Academy of Sciences Award for Initiatives in Research.
I just read an article in Wired called “Should Web Giants Let Startups Use the Information They Have About You?” The article is really more about issues surrounding Web scraping and data API’s than what the title suggests. (There’s a lot more data out there than just personal information.)
There are many ways to make data more valuable. One can aggregate them. One can add structure/metadata to them. One can filter them. One can join/link different data types together. It’s unrealistic to think that any one organization has gotten all possible value out of any particular data. Someone can always come along and process the data further to generate additional value.
Now, if that’s the case, then it’s economically inefficient (for the general society) if an organization forbids further use of their data. Unfortunately, the debate today tends to be polarized into the free use versus no use camps. One hopes that some kind of market mechanism will emerge as a reasonable middle ground. Today we know so little about the dynamics and the economics of data that it’s not clear what that market will look like and what rules are needed to keep it functioning healthily, but these are issues we need to address rather than throw around rhetoric about ‘freedom’.
Data mining, when done correctly, can improve understanding and provide insight, but data mining just doesn’t work under stupid assumptions. Check out the following paragraph in a Wall Street Journal blog. Apparently some FBI agents assume hummus sales to be predictive of terrorist activity.
The FBI obtained and mined sales data that San Francisco-area grocery stores collected in 2005 and 2006, according to CQ Politics. The agents were looking for a sudden spike in hummus sales that might indicate an Iranian sleeper cell in the Bay Area. An FBI higher-up killed the program, CQ Politics reports, on the grounds that it might be illegal and that it was, well, just ridiculous.
According to a TechCrunch post:
Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.
Not much info yet, but can’t wait…
John Langford (at Yahoo) has a good post on The Privacy Problem in datamining at his Machine Learning (Theory) blog. The privacy issue is getting a lot of attention in the datamining community lately. In fact, there’s a whole research area on privacy-preserving datamining emerging, although most results to date have tended to demonstrate how hard it is to guarantee privacy. The negative publicity surrounding datamining has prompted KDnuggets (a newsletter for dataminers) to poll its readers whether the term “datamining” has become an inaccurate/misunderstood term to describe what they do, especially given the fact that a lot of datamining don’t deal with data about individuals.
The main issue is really the trade-off between the benefits and the privacy problems of large-scale data collection and retention. Unfortunately, there’s so much unintended consequences, both good and bad, from data collection that it’s hard to even fully discuss the pros and cons. People who’ve worked with data know the positive potential of finding new uses from data originally collected for other purposes. On the other hand, even well-intentioned efforts, such as AOL’s release of query data, are problematic if they are not handled carefully. Of course, it doesn’t help that some government datamining efforts are just outright bad ideas to begin with.
Read/WriteWeb has a couple interesting posts on Google’s personalized search. One is a primer written by Greg Linden. In the primer, Greg explains what personalized search is, the motivation for it, Google’s personalization technology from its Kaltix acquisition, and other approaches to personalization. I agree with one of the commenters that Kaltix’s technology probably play only a small role in Google’s personalized search today, as that technology was invented many years ago.
The other post on Read/WriteWeb discusses the finding of a poll in which R/WW asks their readers how effective is Google’s personalized search. Their conclusion is that it is quite underwhelming. 48% “haven’t noticed any difference.” Although 12% claim their “search results have definitely improved,” 9% think their search results “have gotten worse.”
I would venture to guess that R/WW readers are more Web savvy and use search engines more often than the average person. Thus Google would have a lot more data about them and give them the most personalized results. (Yes, it’s only a conjecture. They may in fact be so savvy that they use anonymizing proxies, multiple search engines, etc.) If that’s the case, then even more than 48% of the general population wouldn’t “notice any difference” as Google would not be as aggressive in giving them personalized results.
Google has historically justified the storage of personal information as a necessity to providing personalized services. If they’re unable to demonstrate significant benefits from personalization, then there’s more ammunition to privacy advocates for restricting Google’s data collection practices.
As a user I’m not impressed with Google’s personalized search, and I think the use cases for personalized search that Google and others have talked about are usually too contrived anyways, but I believe it’s still too early to jump to conclusion. Like most R/WW readers, I’m quite savvy with search. I’m pretty good at specifying queries to find what I’m looking for. (And search results that I’m not happy with are almost never due to a lack of personalization.) I know I should search for ‘java indonesia’ if I want to find out about the island and ‘java sdk’ if I’m looking for the Java development kit. If I live in Texas, I won’t necessarily be impressed if searching for ‘paris’ gives me results for Paris, Texas instead of Paris in France. Of course, that’s just me. Other people may be different… or may be they’re not.
Via today’s WSJ article (subscription required),
Today, Microsoft Corp. will announce new policies and technologies to protect the privacy of users of its Live Search services, say executives at the software maker. The company, along with IAC/InterActiveCorp.’s Ask.com, will also announce plans to try to kick-start an industrywide initiative to establish standard practices for retaining users’ search histories.
Meanwhile, Yahoo Inc. this week will begin detailing its plans for a policy to make all of a user’s search data anonymous within 13 months of receiving it, except where users request otherwise or where the company is required to retain the information for law-enforcement or legal processes, according to a spokesman.
It’s a welcome effort for the search engines to start valuing privacy. The WSJ article also points out an interesting strategic implication,
Both [Microsoft and Ask] lag far behind Google and Yahoo in Internet-search market share and thus have far less data about search behaviors than their rivals. By calling for more defined standards on privacy, Microsoft could indirectly limit Google’s ability to use its vast stores of information to improve its services.
Not only does Google have more data due to their higher market share, they’ve also been collecting such data far longer than anyone else. (I mean usable data. Of course Microsoft and Yahoo have been around a lot longer than Google, but they don’t seem to have had a coherent data collection policy for a long time and much of their older data may not be very usable.) Google’s recent strategic emphasis on personalization would also be blunted if they’re limited in how much data they can collect. “Privacy” policies that lessen the usefulness of older data would hurt Google a lot more than it would hurt Microsoft.
Bruce Schneier, a noted security expert, has written a Wired column titled Strong Laws, Smart Tech Can Stop Abusive ‘Data Reuse’. In the article he notes that most privacy violations are the result of data reuse.
When we think about our personal data, what bothers us most is generally not the initial collection and use, but the secondary uses. I personally appreciate it when Amazon.com suggests books that might interest me, based on books I have already bought. I like it that my airline knows what type of seat and meal I prefer… What I don’t want, though, is any of these companies selling that data to brokers, or for law enforcement to be allowed to paw through those records without a warrant.
From a data strategist’s point of view, ‘data reuse’ is in fact a broad tool that’s usually innocuous. Almost all analytic applications look at data that was originally collected just for transaction purposes. Pagerank reuses link data for ranking web pages (considering that link data was originally designed only for navigation). Even Bruce’s Amazon book suggestion example is a reuse of data. Your purchase data’s ‘first use’ is for purchasing, and using it for recommendation is secondary.
However, when it comes to personal information, Bruce does have a point in that people have certain expectation of control. European laws legally respect such control by forbidding the sales of personal information and the cross-referencing of different databases on people. (At least that’s my limited understanding.) Besides law, technology can also play a role. The Stanford database group has published a Vision Paper: Enabling Privacy for the Paranoids that examines the use of agent and security technologies for individuals to retain control of their information. Specifically, their P4P framework “seeks to contain illegitimate use of personal information that has already been released to an external (possibly adversarial) entity.” That is, to contain the illegitimate reuse of personal info. They start off with simple examples such as (automatically) generating a unique email address for each merchant that you come into contact with. You can audit and turn off any email address that’s found to be used for inappropriate purposes. The paper goes on to suggest other techniques for other forms of data and purposes. It’s only a vision paper and by no means are all the issues dealt with, but it certainly is food for thought.