Somehow in the last couple days I was talking to different people about automated ways of finding similar content. One was thinking about finding similar documents in a certain text domain (that I couldn’t reveal yet). Another was just thinking out loud about finding similar images (in a domain that… um… let’s just say makes a lot of money and is consistently an early adopter of video technologies…).
The interesting thing is that both person have some computer science background but no specialty in pattern recognition or statistics. They’ve heard of cases where finding similar documents or images have worked, so they imagine that the general problem has been solved, and there shouldn’t be any difficulty to their specific application.
The fact of the matter is much more nuanced than this. It’s quite easy to invent some heuristic measure that calculates the similarity of two documents. You can apply it to your document set, and in fact it does return documents that are similar in some way…
Now, if the heuristic turns out to measure similarity in exactly the way you’ve intended it, then life is great. You give the heuristic some fancy name to impress your boss or the media, and you’re done.
Unfortunately, all too often there’s more than one way to consider similarity, and your measure ends up being influenced by all these other ways. Take text documents for example. People tend to think of similarity as topical similarity, but documents can in fact be similar for other reasons. They may have similar length, and they may have similar style if they’re from the same author, publisher, or genre (news, blogs, academic papers, etc.). Take facial images as another example, two images can be considered similar if they’re taken under similar lighting conditions, even for completely different subjects. Humans have a natural bias to ignore those kinds of similarity, but objectively they’re as legitimate as any other ways of considering similarity.
Most of the development work will actually be in tweaking the similarity measure so that it’s invariant to all these other kinds of similarity. You may need to normalize the input (against document length or lighting condition), and you may have to consider a totally different set of input features that are more robust. Successful cases in restricted domains (e.g., finding similar articles within a news site) are certainly good places to start your heuristic, but don’t underestimate the potential amount of work needed to adapt to a related problem domain.