Data Strategy

September 29, 2007

“A Short Course in Thinking About Thinking”

Filed under: People and Data — chucklam @ 8:34 pm

Danny Kahneman, Nobel Prize winner and co-creator of behavioral economics, led a two-day class on “thinking about thinking.” has a sampling of videos and corresponding transcripts of class sessions. His work tends to explore biases in people’s thinking. Kahneman is widely influential in psychology, decision theory (the descriptive kind), and economics. However, his work deserves a even broader audience.

On two views of a problem: In business, people sometimes do both a “top-down” and a “bottom-up” forecast to ensure reasonable prediction. This can be abstracted to problem solving in general.

[T]here are two ways of looking at a problem; the inside view and the outside view. The inside view is looking at your problem and trying to estimate what will happen in your problem. The outside view involves making that an instance of something else—of a class. When you then look at the statistics of the class, it is a very different way of thinking about problems. And what’s interesting is that it is a very unnatural way to think about problems, because you have to forget things that you know—and you know everything about what you’re trying to do, your plan and so on—and to look at yourself as a point in the distribution is a very un-natural exercise; people actually hate doing this and resist it.

On Bernoulli’s utility theory:

Lots of very very good people went on with the missing parameter for three hundred years-theory has the blinding effect that you don’t even see the problem, because you are so used to thinking in its terms. There is a way it’s always done, and it takes somebody who is naïve, as I was, to see that there is something very odd, and it’s because I didn’t know this theory that I was in fact able to see that.

On people’s inability to predict their own happiness: In one experiment, his team paid volunteers to eat the same kind of ice cream for eight days. The volunteers were asked to predict their rating of the ice cream on the last day.

Most people get tired of the ice cream, but some of them get kind of addicted to the ice cream, and people do not know in advance which category they will belong to. The correlation between what the change that actually happened in their tastes and the change that they predicted was absolutely zero.

It turns out—this I think is now generally accepted—that people are not good at affective forecasting. We have no problem predicting whether we’ll enjoy the soup we’re going to have now if it’s a familiar soup, but we are not good if it’s an unfamiliar experience, or a frequently repeated familiar experience.

Furthermore, from other research…

It turns out that people’s beliefs about what will make them happier are mostly wrong, and they are wrong in a directional way, and they are wrong very predictably.

(Here an interesting question pops into my head. While most people cannot predict at all what will make them happy, are there people who are actually good at it? If so, what are they like?)

On people’s ability to maintain logical consistency: There are many examples of experiments where people are asked related questions but give logically incompatible answers. It’s easy for us to look at the questions/answers side by side and see the inconsistency, but…

The point is that life serves us problems one at a time; we’re not served with problems where the logic of the comparison is immediately evident so that we’ll be spared the mistake. We’re served with problems one at a time, and then as a result we answer in ways that do not correspond to logic.


September 25, 2007

Social networks turning to targeted advertising

Filed under: Advertising, Personalization — chucklam @ 5:04 am

I saw a number of articles in the last couple months on how Myspace and Facebook are turning to personalized ad targeting. Just thought to note them here.

September 21, 2007

Google to enable access to its social graph data

According to a TechCrunch post:

Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.

Not much info yet, but can’t wait…

September 19, 2007

Examining MySpace usage by high school

Filed under: People and Data — chucklam @ 4:55 pm

My last blog post examined current Facebook usage among high school students and found that Facebook penetration is noticeably higher for private high schools and highly rated public schools. The analysis was motivated by Danah Boyd’s personal observation that Facebook and MySpace usage reflects a social/economic/aesthetic class difference in America. Today I will look at some MySpace data to see if it provides further evidence to her observation.

For the Facebook analysis, I was able to get the percentage of students at each high school with Facebook profiles. Getting the same data for MySpace would be ideal. Unfortunately, such information in MySpace is extremely unreliable. As a substitute, I’ve defined a metric called “MySpace intensity” for each high school. It’s the total number of MySpace profiles that claim a certain high school divided by the current population of that school. The intensity can be (and often is) higher than one since both current students and alumni are included. For those interested, I’ll give more details on the data collection at the end of this post.

As in my Facebook analysis, I examined a number of public high schools in San Francisco and noted their “GreatSchools Rating” from (which is also where I found the population data for the high schools).

School MySpace intensity GreatSchools Rating
Lowell 0.95 10
Abraham Lincoln 1.02 8
School of the Arts 0.64 7
George Washington 1.09 7
Balboa 1.53 6
Wallenberg 1.01 6
Phillip Burton 1.16 6
Thurgood Marshall 1.19 5
Mission 1.33 4
ISA 1.2 4
Independence High 0.81 4

Here the trend is clearly the opposite of Facebook. MySpace intensity is higher for the lowly-rated schools. The social networking site is just not as popular in the “better” schools. Independence High School may seem like an anomaly, but remember that Independence did not even have a Facebook network at all. The students there may just be not terribly networked online.

To put the Facebook and MySpace analysis together, I’ve defined a “Facebook/MySpace index” (F/M index). For each high school, its F/M index is the number of current students with Facebook profiles divided by the number of current and past students with MySpace profiles. It’s equivalent to the Facebook penetration rate divided by MySpace intensity. The result:

School F/M index GreatSchools Rating
Lowell 0.68 10
Abraham Lincoln 0.36 8
School of the Arts 0.88 7
George Washington 0.31 7
Balboa 0.23 6
Wallenberg 0.22 6
Phillip Burton 0.14 6
Thurgood Marshall 0.18 5
Mission 0.09 4
ISA 0.10 4
Independence High 0.00 4

The table above clearly shows that school quality is indicative of its students’ taste preference for Facebook versus MySpace. The individual analysis of Facebook usage and MySpace usage shows that polarization goes both ways; neither Facebook nor MySpace is universally liked.

To get a sense of whether socio-economic class is a factor, I’ve gotten the measurements for some of the private high schools in San Francisco. (I’ve repeated the Facebook penetration rate here for comparison.)

School Facebook penetration rate MySpace intensity F/M index
Sacred Heart 98% 0.89 1.10
St. Ignatius 64% 1.07 0.60
Mercy 47% 1.61 0.29
Riordan 35% 1.23 0.28

A surprising result here is that many private high school students are quite active in MySpace as well, more so than students from highly rated public high schools. However, they’re even more active on Facebook, thus their F/M index is in the same range as the good public high schools. My personal knowledge would also say to not lump the four private high school together completely. While Sacred Heart and St. Ignatius are very much the classic preppy private schools, Riordan tends to attract students who would otherwise be assigned to poor public schools (e.g. Mission High) and their families essentially have to pay to get a decent education. Mercy is an all-girl school, and one can argue that MySpace has more of an appeal to teenage girls.

And advertisers may know this class difference between Facebook and MySpace already. On my Facebook page today, the ad I see is for Embassy Suites Hotel. When I go to MySpace, the first ad I see is for the new TV series GossipGirl, and yesterday the first ad I saw was for Trojan condoms…


To find the number of MySpace profiles for a high school, I use the MySpace classmate finder. Unlike on Facebook, I couldn’t get a list of high schools in San Francisco. Instead, I have to specify the high school’s name


MySpace does seem to have an internal database of high schools in the U.S., just like in Facebook. It’s just not as well exposed. One can see from the search results that it in fact knows International Studies Academy is in San Francisco.


Clicking on that link shows the search results for all MySpace profiles associated with International Studies Academy. The number of results returned is the information I use to calculate a high school’s MySpace intensity. Although there’s a “Refine Your School Search” function that supposedly allow you to narrow down the results, it works very poorly. For example, if I filter to specific graduation or attendance years, a random sample of the search results shows that those users in fact had graduated/attended the school in very different years. After several tries, I decided the search refinement feature simply doesn’t work and to not use it.


September 14, 2007

Analyzing Facebook usage by high school demographic

Filed under: People and Data — chucklam @ 3:05 pm

Danah Boyd blogged a few months ago on Viewing American class divisions through Facebook and MySpace. She had observed that

Hegemonic American teens (i.e. middle/upper class, college bound teens from upwards mobile or well off families) are all on or switching to Facebook. Marginalized teens, teens from poorer or less educated backgrounds, subculturally-identified teens, and other non-hegemonic teens continue to be drawn to MySpace.

My anecdotal observation is very consistent with hers. The class division also explains why Silicon Valley (a very “hegemonic” class) never understood MySpace. MySpace’s rise had first taken them completely by surprise, when they were only paying attention to Tribe, Orkut, Xanga, etc. (Friendster already had its fame and was imploding at the time.) They grudgingly pay MySpace some respect after its $600M+ acquisition. Now that Facebook is the new hot social network, the Silicon Valley crowd is quick to dismiss MySpace again.

Unfortunately, Danah didn’t have any quantitative data to back up her observations. So I started to gather some data for Facebook and want to share them here. Since it’s only Facebook data, it doesn’t help in comparing Facebook with MySpace. However, the trend on Facebook alone is still interesting, and it’s not inconsistent with Danah’s argument either.

The tool of my data gathering effort is Facebook’s network browsing function:


Let me use International Studies Academy (ISA) in San Francisco, where I went for high school, as the first example. Going to this network’s homepage promptly shows a message that “The content on this Network page is restricted to members of this network only. Only current high school students can join high school networks,” which is fine since the only info I care about is already shown on the homepage.


That is, 52 current students at ISA are on Facebook. It’s a pretty low number. Ok… yeah… my high school is pretty ghetto. It’s not really the Facebook crowd. But let’s be more rigorous in correlating its ghetto-ness with Facebook usage. Looking at ISA’s profile page at, we see that the school has 421 students, giving it a Facebook penetration rate of 52/421=12%. The site also gives my high school a “GreatSchools Rating” of 4… out of a possible 10 😦

I next do the same analysis for the nerdiest high school in San Francisco – Lowell High School. Its FB penetration rate is 65% and its GreatSchools Rating is a perfect 10. Let’s look at a few more public high schools in San Francisco since I’m familiar with the school district.

School FB penetration rate GreatSchools Rating
Lowell 65% 10
Abraham Lincoln 37% 8
School of the Arts 56% 7
George Washington 34% 7
Balboa 35% 6
Wallenberg 22% 6
Phillip Burton 16% 6
Thurgood Marshall 22% 5
Mission 12% 4
ISA 12% 4
Independence High 0% (no FB entry) 4

The correlation between FB usage and high school quality is quite good. Now, it would be nice to correlate Facebook usage with socio-economic class rather than just high school quality. Unfortunately I couldn’t find any stat for average household income for different high schools. Instead let’s assume going to a private high school represents a higher socio-economic class. The FB penetration rates for some of the private high schools in SF are:

School FB penetration rate
Sacred Heart 98%
St. Ignatius 64%
Mercy 47%
Riordan 35%

So being in a private high school or in a highly rated public one is very predictive of one being in Facebook as well. I’ve only presented data for San Francisco, but a similar analysis can be done for high schools across the U.S. (It only takes more time.) I believe the conclusion will be the same.

Again, I haven’t gathered any data on MySpace, so it’s still possible that Danah’s hypothesis is wrong. MySpace usage may in fact increase for students in private/”good” high schools too. My intuition is that Danah is right though, and hopefully I’ll get around to collecting some MySpace data to prove it one way or the other.

Update: I found some MySpace data and have done further analysis. See my post here.

Update 2: Teresa Klein over at did some further statistical analysis and found a correlation coefficient of 0.87 between FaceBook penetration rate and GreatSchools Rating for the San Francisco public schools listed above. Better yet, she repeated the analysis for Seattle high schools and found a correlation coefficient of 0.79 there. The pattern seems to be holding. See her post here.

September 10, 2007

How AdSense can be improved with NLP

Filed under: Advertising — chucklam @ 3:25 pm

I have this web site that I’ve been using for the last 10 months to study web advertising and monetization. I was surprised to find that begging users for money is as effective as AdSense. Partly it’s because users were more willing to send money than I had expected. Partly it’s also due to the fact that AdSense hasn’t worked well for my site.

To give a little background, is a site for amateur web designers who want to put rounded corners in their web page design. The AdSense ads I see now are

  1. Corner Protectors
  2. Nissan Lights on Sale
  3. Corner Board
  4. Help Elect Barack Obama (banner ad)

Of course, you may see different ads than I do. (And I’ll see different ones if I refresh.) The point is that they’re generally pretty irrelevant.

When I first put up RoundedCornr and saw the useless ads, my first reactions was to ping my friends working at AdSense to see if they would suggest anything. Well, I can safely say that having friends at Google doesn’t help you much. (And their “we’re not clueless, we’re just secretive” stance has never really worked on me…) I was told that AdSense is not optimized for “this kind of web site,” by which I assume they mean it works better for blogs and news sites. I was also told to change some of the wording to avoid triggering some of the bad ads. Well… it’s hard to avoid using the word “corner” in describing my site, and some ads just seem totally unrelated to anything I’ve said on my site anyways. I was also told to wait until the system learns from user click-throughs. Well… it’s been 10 months now…

For things like banner ads, it’s pretty easy to figure out why it’s so bad: there just isn’t enough inventory. Unfortunately, the content of these banner ads is also what I have the most issues with. I haven’t decided on which presidential candidate to support yet, so it’s a bit misleading to have a Barack Obama ad. At some other time, the banner ad was soliciting support for more border patrols. (Maybe it was triggered by my description of rounded corners with “border”.) I had tried to remove that banner ad since it doesn’t reflect my political belief, but AdSense didn’t like to give its users much control.

For text ads, where inventory is not a problem, what can AdSense do to be more targeted? Personalization and behavioral targeting have been suggested before, but I think a little natural language processing will help a lot more. And unlike search, NLP for contextual advertising can help without needing much change in user behavior. The technology needed is also more achievable and should be within the grasp of technologists including Powerset (see their first public demo here.) The idea is to increase the semantic understanding of page content so that advertising is more semantically relevant.

Specifically, there’re two technologies that I’m thinking of. One is to use a language parser that pick out the main subject words in the sentences of a page. Instead of indexing keywords based on simple statistics, one only indexes nouns and noun phrases. (Verbs, adverbs, etc. are quite secondary in terms of semantic understanding, especially for advertising.)

The other NLP technology to use is word sense disambiguation. A word like “jaguar” has many senses. One sense being a type of animal, another being a brand of cars. Automated techniques exist to figure out which sense is being used in a sentence. An AdSense advertiser should then be able to specify that she wants to advertise on pages that talk about “jaguar” in the car sense of the word, and not just any page that mentions “jaguar.”

Granted, it’s a lot easier said than done. Word usage on the web requires algorithms to be more dynamic and scalable than most academic research has looked at. However, “semantic contextual advertising” is still simpler than “semantic search.” If I have the Powerset technologies, I’d seriously look at contextual advertising as another business model to go after.

September 8, 2007

AdSense versus just begging your users for money

Filed under: Advertising, Personalization — chucklam @ 4:46 pm

Last December I built a little web site for fun called The site is a simple way of automatically generating HTML/CSS code for creating rounded corners on a web page design. In addition to being a fun project, I also used the site as an opportunity to learn about several advertising/monetization schemes.

The site gets between 400 to 900 pageviews per day, with an average of 740 and a grand total of 205,000 to date. (As far as monetization is concerned, the site is really just a one-page site.) I have various monetization schemes on the page including Google AdSense, Amazon Omakase, some affiliate marketing run by Commission Junction, and just plain begging for a $5 “fee.” (I ran into trouble with the Google Checkout police when I called it a “contribution,” but that’s another story.) The affiliate marketing is chosen by me so it’s highly relevant, but it’s gotten no revenue at all in the last 10 months 😦 The Amazon Omakase program is Amazon’s contextual advertising/referral program. You can think of it as AdSense with “behavioral targeting” but only promotes products on Amazon and you get paid by referrals instead of clicks. Personally I’m impressed with how well targeted the Amazon ads are, but they’re quite poor in terms of monetization. People don’t seem to be clicking on them much and so far I’ve only had one who ended up purchasing something. I don’t know why such well targeted ads can do so poorly. My hope is that it’s just specific to my site’s audience. Maybe they already got their web design book or whatever they need from Amazon and have no need to purchase more…

The surprising thing is that AdSense has so far only made a few more dollars than begging. Begging has gotten me $120 in the last 10 months, and the users have to scroll all the way down the web page and read the details to even know that I’m begging for $5. A straightforward calculation would say that I’ll lose half my revenue if I take out all the ads, but my sense is that a lot more users will feel more comfortable paying the $5 “fee” when they don’t see any ads on the page. Will that completely compensate for the lost of AdSense revenue? I don’t know. Maybe, maybe not, but I do feel that users’ contribution is a better “quality” income than advertising revenue. In fact, I think AdSense advertising is so bad on my site that I’m surprised anyone has clicked on them at all, even though somehow there were 1500 clicks. Of course, in the grand scheme of things, the financial impact is so little that it’s not worth strategizing over.

Anyways, that’s one data point from me. Anyone else want to share their experience?

September 6, 2007

Spelling corrector that learns from query logs

Filed under: Information Retrieval — chucklam @ 1:59 am

I had previously written a post pointing to a Peter Norvig (Director of Research at Google) article on how to write a statistical spelling corrector. My post noted that Peter’s article didn’t explain how spelling suggestions by search engines (such as Google) learn from query logs. Well, it turns out that Silviu Cucerzan and Eric Brill over at Microsoft had already published a great paper in 2004 at EMNLP called Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users (pdf). It explains some of the unique challenges of spelling correction in search queries. For example, new and unique (but correct) terms become query terms all the time, so one can’t just construct a dictionary of correct spellings. Simple frequency counting also doesn’t work as certain misspelled queries (“britny spears”) occur very often. A misspelled query may be composed of correctly spelled terms (“golf war”). Fortunately, Silviu and Eric show how clever use of the query log can overcome these and other problems.

September 3, 2007

Google AdSense enters affiliate marketing

Filed under: Advertising — chucklam @ 10:43 pm

I don’t remember hearing this anywhere else, so I was surprised when I checked the AdSense site to find the AdSense Referrals program had expanded from promoting just Google products (e.g. Google Apps, Google Pack, Firefox) to third party products.

Most people think of AdSense as “AdSense for Content,” which is you letting Google put ads on your site and you earn money on a pay-per-click model. Referrals (aka affiliate marketing) is very similar except you earn money through a pay-per-action model. The “action” is determined by the advertiser, and it can mean an actual purchase, filling out a form, a software download, etc. The main strategic factor differentiating pay-per-click and pay-per-action is “conversion risk,” or the risk of whether someone who’s clicked on an ad will actually convert into a buyer (or take some other action). Right now the advertisers are taking that risk, and judging from the number of advertisers signing up for Google, it’s a worthy risk to take. The bet with pay-per-action is that Google is in a even better position to take on such risks, as Google can aggregate and smooth out the uncertainty and leverage its informational advantage to in fact reduce such risk.

Of course, other factors will be involved in the success of the program as well. It’s not clear that publishers (i.e. site owners) would care much for the PPA model, although Google may only be testing PPA on AdSense first before moving it to AdWords on the main search engine site. (The truth is, any publisher who cares much about advertising income would’ve gone beyond AdSense long time ago, but that’s for another post.) Advertisers may be hesitant to share so much information with Google, especially if Google is also their main source of traffic. The amount of work needed to integrate all the tracking/accounting is also non-trivial, which may turn off many advertisers, although here Google Checkout may eventually play a role.

The best known affiliate marketing networks are Amazon, eBay, and Commission Junction. I don’t think any of them will be too happy about this development from Google.

AdSense Referrals description

AdSense Referrals categories

September 1, 2007

Debunking the “small world” myth

Filed under: Network analysis, People and Data — chucklam @ 3:22 pm

It’s pretty well ingrained in popular educated culture (at least in the U.S.) that “everyone” is separated by no more than six degrees of separation, that it is a “small world.” The promoters of the idea often point to Stanley Milgram’s experiment of having “random” people in Kansas forward a letter to acquaintances until it reaches a specific person in Boston and that no more than five intermediaries were needed. The experiment has supposedly been repeated enough times to become solid “science.”

The fact that so many people would believe such a ridiculous idea is itself a pretty interesting phenomenon, especially it’s the more educated people who tend to believe it. I’ve finally come across an article today, “Could It Be A Big World After All? The ‘Six Degrees of Separation’ Myth” by Judith S. Kleinfeld, that had dug into Milgram’s archive at Yale and point out the paucity of evidence for the “small world” interpretation and the lack of experimental replication across subjects of any significant distance (i.e. from two different cities). She gave several possible explanations for the persistence of this “small world” myth.

As I listened to these descriptions of cherished small world experiences, I realized that these experiences had a different mathematical structure from the classic small world problem that Milgram and the mathematicians were investigating. The classic “small world problem” is expressed in such forms as: What are the chances that two people chosen at random from the population will have a friend in common? But the small world experiences I was hearing about would be expressed mathematically in a very different form: What is the probability that you will meet a friend from your past or a stranger who knows a friend from your past over the course of your lifetime?

How likely would it be, particularly for educated people who travel in similar social networks, never to meet anyone anywhere anytime who knew someone from their past? We have a poor mathematical, as well as a poor intuitive, understanding of the nature of coincidence.

A poor intuitive understanding of probability partly explains people’s willingness to believe the “small world” myth. I think a deeper explanation is that people simply refuse to believe how predictable their social lives are. Rather than doing the hard work of meeting interesting people and living an interesting life, educated people do what they do best–rationalize things. When your “random” friends happen to know each other, the best explanation is not that it’s a “small world.” The best explanation is that your friends simply aren’t random!

Blog at