John Langford (at Yahoo) has a good post on The Privacy Problem in datamining at his Machine Learning (Theory) blog. The privacy issue is getting a lot of attention in the datamining community lately. In fact, there’s a whole research area on privacy-preserving datamining emerging, although most results to date have tended to demonstrate how hard it is to guarantee privacy. The negative publicity surrounding datamining has prompted KDnuggets (a newsletter for dataminers) to poll its readers whether the term “datamining” has become an inaccurate/misunderstood term to describe what they do, especially given the fact that a lot of datamining don’t deal with data about individuals.
The main issue is really the trade-off between the benefits and the privacy problems of large-scale data collection and retention. Unfortunately, there’s so much unintended consequences, both good and bad, from data collection that it’s hard to even fully discuss the pros and cons. People who’ve worked with data know the positive potential of finding new uses from data originally collected for other purposes. On the other hand, even well-intentioned efforts, such as AOL’s release of query data, are problematic if they are not handled carefully. Of course, it doesn’t help that some government datamining efforts are just outright bad ideas to begin with.