The first Predictive Analytics World is coming to San Francisco on Feb. 18 and 19. It has a great line-up of accomplished speakers. Usama Fayyad (of Yahoo) and Andreas Weigend (formerly of Amazon and a good friend of mine) are keynote speakers. The Netflix Progress Prize winner “BellKor in BigChaos,” whom achieved a 9.63% increase over Netflix’s baseline system, will speak about their system in a case study.
Readers of this blog can get a 15% discount to the conference if you use the registration code: datastrategypaw09. The conference promises to be a good overview of the state of the industry in data analytics.
The conference organizers are also having an online survey to examine trends in the data analytic tools and industry. You can take the survey at https://www.surveymonkey.com/s.aspx?sm=8dHx_2bFz7yxw3FPKlbi3OVg_3d_3d.
Just found out that some of my favorite Stanford engineering classes have published their videos online for the general public. Readers of this blog may be especially interested in Chris Manning’s class on Natural Language Processing and Andrew Ng’s class on Machine Learning. In addition, having been an EE person myself, I can highly recommend Steve Boyd’s classes on Linear Systems for those who can deal with more advanced math. He teaches Introduction to Linear Dynamical Systems and Convex Optimization I/II.
A few more classes are also available. Go to the course listing page here.
Google has released a series of YouTube interviews with their lead engineers. Embedded below is one about MapReduce. The four engineers interviewed include the inventors of MapReduce. Some quotes:
6:17 – If we haven’t had to deal with [machine] failures… we would have probably never implemented MapReduce. Because without having to support failures, the rest of the machine code is just not that complicated.
7:20 – (Interviewer) What do you feel the technology [MapReduce] isn’t applicable for?… (Sanjay Ghemawat, Google Fellow) you can always squint at [a problem] at the right way… you can usually find a way to express it as a MapReduce…, but sometimes you have to squint at things in a pretty strange way to do this… For example, suppose you want to compute the cross correlation of every single pair of web pages in terms of saying what is the similiarity… I can run a pass where I just sort of magnify the input into the cross product of the inputs and then I can apply a function on each pair in there saying how similar it is. You intermediate data will be quadratic in the size of the input, so you probably don’t want to do it that way. So you’ll have to think a bit more carefully what your intermediate data is in that case… There’s a lot of thinking at the application level if you want to use MapReduce in that scenario. [Emphasis mine]
18:14 – (Matt Austern, SW engr) One of the core implementation issues in MapReduce is how you get the intermediate data from the Mappers to the Reducers. Every Mapper writes to every Reducer, and so it ends up making very heavy use of the network… (Interviewer) If you really want to provide a lot of computing, its very easy, one would think, to just buy lots more microprocessor… but the issue is communication between them… (Jerry Zhao, SW engr) Communication is not only the limit. How to coordinate the communication channel itself is also an interesting problem.
20:17 – MapReduce was originally designed as a batch processing system for large quantity of data. But we see our users are using MapReduce for relatively small set of data but have very strict latency requirement.
This is probably besides the point, but everyone in the video except maybe Sanjay sounds really scripted and robotic…
Digg released their “collaborative filtering” system a month ago. Now they’ve blogged about some of the initial results. While it’s an obviously biased point of view, things look really good in general.
- “Digging activity is up significantly: the total number of Diggs increased 40% after launch.”
- “Friend activity/friends added is up 24%.”
- “Commenting is up 11% since launch.”
What I find particularly interesting here is the historical road that “collaborative filtering” has taken. The term “collaborative filtering” was first coined by Xerox PARC more than 15 years ago. Researchers at PARC had a system called Tapestry. It allowed users to “collaborate to help one another perform filtering by recording their reactions to documents they read.” This, in fact, was a precursor to today’s Digg and Delicious.
Soon after PARC created Tapestry, automated collaborative filtering (ACF) was invented. The emphasis was to automate everything and make its usage effortless. Votes were implied by purchasing or other behavior, and recommendation was computed in a “people like you have also bought” style. This style of recommendation was so successful at the time that it had completely taken over the term “collaborative filtering” ever since.
In the Web 2.0 wave, companies like Digg and Delicious revived the Tapestry-style of collaborative filtering. (Although I’d be surprised if those companies had done so as a conscious effort.) They were in a sense stripped-down versions of Tapestry, blow up to web scale, and made extremely easy to use. (The original Tapestry required one to write database-like queries.)
Now Digg, which one can think of as Tapestry 2.0, is adding ACF back into its style of recommendation and getting extremely positive results. Everything seems to have moved forward, and at the same time it seems to have come full circle.
It’s a truism among dataminers that analyzing certain data can help us understand people. However, dataminers rarely see that psychology, the discipline of understanding people, can help get more value out of such data. It was recently reported that one of the top ten contestants in the Netflix Prize approached the challenge from a psychologist’s point of view rather than from a computer scientist’s.
For example, people don’t often give their “true” rating on movies. Instead, they can be biased by anchoring. That is, their rating of a movie is influenced by the ratings they had just given earlier for other movies. Adjusting for biases such as this is how Gavin Potter, aka “Just a guy in a garage,” got to be number 9 in the Netflix Prize leaderboard.
Many of us are well aware that retailers, credit card companies, and other consumer-oriented businesses amass huge amount of information about people and mine such data to improve sales and profitability. However, it’s much less publicized how politicians also collect and mine voter information to help win elections.
A recent article in Vanity Fair called Big Brother Inc. digs into a company that focuses on collecting and mining data on voters. The company, Aristotle Inc., “contains detailed information about roughly 175 million American voters… [It] has served as a consultant for every president since Ronald Reagan… In the 2006 elections, Aristotle sold information to more than 200 candidates for the House of Representatives…, a good portion of those running for Senate, and candidates for governor from California to Florida, to New York.”
“Aristotle can tell its clients more than just the predictable stuff—where you live, your phone number, who lives with you, your birthday, how many children you have. It may also know how much you make, how much your house is worth, what kind of car you drive, what Web sites you visit, and whether you went to college, attend church, own guns, have had a sex change, or have been convicted of a felony or sex crime.”
Also, an interesting point in the article about hardball tactics in politics: Although lists of registered voters are technically public information, it is sometimes not easy to obtain them. Local party bosses can make it difficult/expensive for opposing candidates to get a hold of those lists, and thus the political opponents will be at a disadvantage to get their message out to the right audience.
Date & time: Wednesday, January 16 4:35-5:45 pm
Speaker: Jon Kleinberg, Cornell University
Title: Computational Perspectives on Large-Scale Social Network Data
The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in
formulating models of social processes and in managing complex networks as datasets.
We consider two lines of research within this general theme. The first is concerned with modeling the flow of information through a large network: the spread of new ideas, technologies, opinions, fads, and rumors can be viewed as unfolding with the dynamics of epidemic, cascading from one individual to another through the network. This suggests a basis for computational models of such phenomena, with the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving.
The second line of research we consider is concerned with the privacy implications of large network datasets. An increasing amount of social network research focuses on datasets obtained by measuring the interactions among individuals who have strong expectations of privacy. To preserve privacy in such instances, the datasets are typically anonymized — the names are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed. Unfortunately, there are fundamental limitations on the power of network anonymization to preserve privacy; we will discuss some of these limitations (formulated in joint work with Lars Backstrom and Cynthia Dwork) and some of their broader implications.
Jon Kleinberg is a Professor in the Department of Computer Science at Cornell University. His research interests are centered around issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a Fellow of the American Academy of Arts and Sciences, and the recipient of MacArthur, Packard, and Sloan Foundation Fellowships, the Nevanlinna Prize from the International Mathematical Union, and the National Academy of Sciences Award for Initiatives in Research.
Aaron Swartz just announced a new Web site (theinfo.org) he created for people who love large data sets, “the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.”
The site is very spartan right now, but it can certainly become very interesting if it attracts the right contributors. I hope he succeeds in building a community around it.
Looks like web analytics is growing to be a big enough field for universities to start offering a certificate program in it. A brief description of the UC Irvine program below. The full press release is here.
The program provides a foundational knowledge of practical Web analysis to experienced marketing and business professionals in light of the changing face of e-commerce. The impetus for the program is a growing need for Web site evaluation spurred by companies challenged in their search for new customers due primarily to an explosive growth in Web-based sales and marketing.
Data mining, when done correctly, can improve understanding and provide insight, but data mining just doesn’t work under stupid assumptions. Check out the following paragraph in a Wall Street Journal blog. Apparently some FBI agents assume hummus sales to be predictive of terrorist activity.
The FBI obtained and mined sales data that San Francisco-area grocery stores collected in 2005 and 2006, according to CQ Politics. The agents were looking for a sudden spike in hummus sales that might indicate an Iranian sleeper cell in the Bay Area. An FBI higher-up killed the program, CQ Politics reports, on the grounds that it might be illegal and that it was, well, just ridiculous.