Data Strategy

August 30, 2007

Data mining and privacy

Filed under: Datamining, People and Data, Privacy — chucklam @ 4:09 pm

John Langford (at Yahoo) has a good post on The Privacy Problem in datamining at his Machine Learning (Theory) blog. The privacy issue is getting a lot of attention in the datamining community lately. In fact, there’s a whole research area on privacy-preserving datamining emerging, although most results to date have tended to demonstrate how hard it is to guarantee privacy. The negative publicity surrounding datamining has prompted KDnuggets (a newsletter for dataminers) to poll its readers whether the term “datamining” has become an inaccurate/misunderstood term to describe what they do, especially given the fact that a lot of datamining don’t deal with data about individuals.

The main issue is really the trade-off between the benefits and the privacy problems of large-scale data collection and retention. Unfortunately, there’s so much unintended consequences, both good and bad, from data collection that it’s hard to even fully discuss the pros and cons. People who’ve worked with data know the positive potential of finding new uses from data originally collected for other purposes. On the other hand, even well-intentioned efforts, such as AOL’s release of query data, are problematic if they are not handled carefully. Of course, it doesn’t help that some government datamining efforts are just outright bad ideas to begin with.


Analyzing microblogging

Filed under: Datamining, People and Data, Visualization — chucklam @ 2:40 pm

Akshay Java and colleagues at University of Maryland and NEC Laboratories presented a paper at KDD called Why We Twitter: Understanding Microblogging Usage and Communities. In the paper they analyzed the social graph of Twitter and all the public postings for two months from April 1 to May 30 of this year, so the data is very recent. The conclusions are pretty much in line with what we have come to expect from social networks (power law distribution, community clusters, etc.). In the discussion section, they classify Twitter usage into four categories:

  1. Daily chatter (dominant category)
  2. Conversations (1/8 of all posts which are started with the @ symbol follow by a username)
  3. Sharing information/URLs (13% of all posts)
  4. Reporting news (often by automated agents)

The interesting part to me was really how they collected the data. Apparently the whole Twitter network can be gotten from its API. It’s usually pretty difficult for researchers to get access to large-scale social graphs, so it’s nice to know that the Twitter social graph is easy to get. The network is smaller than I had expected though. Its total user base (at the end of May) is a little less then 88,000. The amount of attention Twitter gets seems to be much bigger than its number of users. In addition, the authors of the paper stated that “After an initial period of interest around March 2007, the rate at which new users are joining Twitter has slowed.”

August 29, 2007

My blog is censored in China :(

Filed under: Uncategorized — chucklam @ 5:10 pm

I thought posting would be light while I was traveling in China. I didn’t know that I wouldn’t be able to access my blog at all. At first I thought my blog was so important that the Chinese government went out of their way to block it, but it turned out that I just couldn’t access anything on I didn’t get a 404 but the connection just timed out. It was a similar experience trying to get to

I know the locals have special browsers and software to tunnel around and gain access to these “forbidden” sites. I wasn’t motivated enough (and I didn’t speak enough Mandarin) to figure it all out. Besides, I was already able to read all the blogs that I regularly follow, as I normally use Yahoo’s RSS Reader, which functions as a proxy. I had no problem accessing any of the American news sites (e.g. NYT, WSJ) either, as they were not blocked.

At any rate, I’m back and will start posting regularly again.

August 18, 2007

Posting will be light…

Filed under: Uncategorized — chucklam @ 5:39 pm

I’m traveling to China for the Int’l Conference on Intelligent Computing, so posting will be light for the next week or so.

August 17, 2007

Fight crime with datamining

Filed under: Datamining, Pattern recognition, People and Data — chucklam @ 2:03 am

Some police departments are using datamining to predict where and when crimes are more likely to occur.

Using some sophisticated software and hardware [the Richmond, Va. police service] started overlaying crime reports with other data, such as weather, traffic, sports events and paydays for large employers. The data was analyzed three times a day and something interesting emerged: Robberies spiked on paydays near cheque cashing storefronts in specific neighbourhoods. Other clusters also became apparent, and pretty soon police were deploying resources in advance and predicting where crime was most likely to occur.

Coupled with some other technological advancement, such as surveillance videos wirelessly transmitted to patrol cars, major crime rates dropped 21 per cent from 2005 to 2006. In 2007, major crime is down another 19 per cent.

August 15, 2007

Chinese web encyclopedia

Filed under: Collective wisdom, Open Data — chucklam @ 4:26 am

PC Advisor has an article about Wikipedia accusing Baidu Baike, the user-generated Chinese web encyclopedia, of copyright violations. While the article focuses on the accusation that some of Baidu Baike’s content were copied from Wikipedia without attribution, I found the description of Baidu Baike quite interesting itself, as I wasn’t aware of the site’s existence until now.

Baidu Baike [is] the largest online Chinese-language encyclopedia. [It] contains more articles than any Wikipedia except the English-language Wikipedia. Baidu Baike boasted 809,237 entries as of Sunday, edging out the German edition of Wikipeida, which has 619,612 entries, for second place.

And Baidu Baike was able to get to its size in spite of extra hurdles for contributors.

Anyone wishing to publish entries on Baidu Baike must register first, giving the site people’s real names, and site administrators review all entries before posting, a way to ensure compliance with Chinese censorship laws.

The copyright accusation, if true, would create some philosophical dilemmas for Wikipedia. Since the Chinese version of Wikipedia is blocked in China, Baidu Baike can in fact be considered as a proxy for people to access Wikipedia’s content. If their disagreement is not successfully resolved, then hypothetically Wikipedia may have to choose between protecting its copyright license and promoting easy access to its content for the people of China.

However, Wikipedia may in fact be too toothless to take action any which way. It has to rely on its moral authority, as its legal power is quite weak.

“The foundation does not hold a copyright on the articles, the editors or the authors do, so there is very little we can do,” said Nibart-Devouar [, chair of the Board of Trustees at the Wikimedia Foundation.]

August 14, 2007

Testing semantic search

Filed under: Information Retrieval, Search — chucklam @ 2:25 am

The Hakia blog has an interesting post on creating query sets to test semantic (i.e. “natural language”) search. The main point is that the query set must have sufficient coverage to test four aspects of natural language search: query type, query length, content type, and word sense disambiguation.

Testing of a semantic search engine requires at least 330 queries just to scratch the surface. hakia’s internal fitness tests, for example, use couple of thousand queries. Therefore, if you see any report or article about the evaluation of a search engine using a dozen of queries, even if it includes valuable insight, it will tell you nothing about the overall state of that search engine.

The distribution of test queries will ultimately be based on real usage, which of course is limited at this point. In general, the more sophisticated function a system is supposed to perform, the more test cases one needs to evaluate it. For something as sophisticated as natural language search (or even plain keyword search), testing is definitely non-trivial.

Powerset’s first public demo was limited in all four of those aspects. Of course, keep in mind that the demo was just to give the public a peek at Powerset’s search technology, not an exhaustive evaluation of it. It was also meant to illustrate the difference between Powerset’s approach and Google’s approach, not to compare and contrast Powerset with Hakia. Personally, I’m just glad that both startups are starting to reveal some of their inner workings.

August 11, 2007

David Heckerman interview

Filed under: Bayesian networks, Pattern recognition — chucklam @ 11:31 pm

A little over a month ago CNet published an interview with David Heckerman, lead researcher of Microsoft’s Machine Learning and Applied Statistics Group. I haven’t heard much about David lately, and apparently he’s been busy developing open source analytical tools for HIV research.

Back in 1990, David had won the ACM Doctoral Dissertation Award for his thesis “Probabilistic Similarity Networks”. He was a major influence in the ’90s in establishing Bayesian networks and Bayesian methodologies as practical tools in AI and machine learning. I remember my adviser lending me a copy of David’s thesis when I first started my PhD study. He told me to read it and to strive for a thesis of similar caliber. (Yes… the first couple years of graduate study tend to be filled with optimism and ambition.) From David’s thesis I had seen the potential of quality research.

In the last five years or so, I haven’t come across any publication by David Heckerman. It’s great to learn that he’s still doing great work, just now in a slightly different field.

August 8, 2007

Early results on Google’s personalized search

Filed under: Information Retrieval, Personalization, Privacy, Search — chucklam @ 4:29 pm

Read/WriteWeb has a couple interesting posts on Google’s personalized search. One is a primer written by Greg Linden. In the primer, Greg explains what personalized search is, the motivation for it, Google’s personalization technology from its Kaltix acquisition, and other approaches to personalization. I agree with one of the commenters that Kaltix’s technology probably play only a small role in Google’s personalized search today, as that technology was invented many years ago.

The other post on Read/WriteWeb discusses the finding of a poll in which R/WW asks their readers how effective is Google’s personalized search. Their conclusion is that it is quite underwhelming. 48% “haven’t noticed any difference.” Although 12% claim their “search results have definitely improved,” 9% think their search results “have gotten worse.”

I would venture to guess that R/WW readers are more Web savvy and use search engines more often than the average person. Thus Google would have a lot more data about them and give them the most personalized results. (Yes, it’s only a conjecture. They may in fact be so savvy that they use anonymizing proxies, multiple search engines, etc.) If that’s the case, then even more than 48% of the general population wouldn’t “notice any difference” as Google would not be as aggressive in giving them personalized results.

Google has historically justified the storage of personal information as a necessity to providing personalized services. If they’re unable to demonstrate significant benefits from personalization, then there’s more ammunition to privacy advocates for restricting Google’s data collection practices.

As a user I’m not impressed with Google’s personalized search, and I think the use cases for personalized search that Google and others have talked about are usually too contrived anyways, but I believe it’s still too early to jump to conclusion. Like most R/WW readers, I’m quite savvy with search. I’m pretty good at specifying queries to find what I’m looking for. (And search results that I’m not happy with are almost never due to a lack of personalization.) I know I should search for ‘java indonesia’ if I want to find out about the island and ‘java sdk’ if I’m looking for the Java development kit. If I live in Texas, I won’t necessarily be impressed if searching for ‘paris’ gives me results for Paris, Texas instead of Paris in France. Of course, that’s just me. Other people may be different… or may be they’re not.

August 7, 2007

Advertising’s digital future

Filed under: Advertising, Datamining, Personalization, Statistical experimentation — chucklam @ 12:23 pm

The New York Times yesterday had an article on advertising’s digital future. It mostly discussed the view of David W. Kenny, chairman and chief executive of Digitas, the advertising agency in Boston that was acquired by the Publicis Groupe for $1.3 billion six months ago.

The plan is to build a global digital ad network that uses offshore labor to create thousands of versions of ads. Then, using data about consumers and computer algorithms, the network will decide which advertising message to show at which moment to every person who turns on a computer, cellphone or — eventually — a television.

“Our intention with Digitas and Publicis is to build the global platform that everybody uses to match data with advertising messages,” Mr. Kenny said.

That is, advertising in the future will be much more data driven. Now, if we take that vision for granted, then the interesting question will be Who will end up controlling what data? No doubt Mr. Kenny would love to see advertising agencies being the central gateway, if not the outright owner, of all such data. However, privacy advocates, media companies, new “intermediaries”, and search engines like Google all have different ideas about their ownership of data and their place in this advertising future. It’s too early to tell how things will turn out, and everyone is making educated guesses.

“How do we see Google, Yahoo and Microsoft? It’s important to see that our industry is changing and the borders are blurring, so it’s clear the three of those companies will have a huge share of revenues which will come from advertising,” said Maurice Lévy, chairman and chief executive of the Publicis Groupe.

“But they will have to make a choice between being a medium or being an ad agency, and I believe that their interest will be to be a medium,” he added. “We will partner with them as we do partner with CBS, ABC, Time Warner or any other media group.”

I wonder if Mr. Lévy has considered the possibility that in this digital future, Google may in fact be CBS, ABC, and Time Warner combined.

Older Posts »

Create a free website or blog at