Zawodny linked to this post making fun of MSN Search for bragging about stemming. The article says:
Understanding “driving” vs. “drive”? That’s a pretty basic problem called stemming. How basic? Put it this way, there’s been open source code out there to this before there was a Microsoft Search. Fuck, even the Wikipedia page has existed since 2007. All they had to do was go there. I know Microsoft is big about “eating their own dogfood”, but damn, use Google just this once to find it.
Now et’s assume for a second, they’ve had stemming and it’s a problem that’s actually harder — something like knowing “car” and “automobile” are the same thing. Now, granted that’s harder, but here’s the thing — when I transfered down to Yahoo Search Marketing, some 4-5 years ago, they already had this technology. The process was called canonicalization, and by the time I got there, it was an old, well-established piece of technology. So well-established that despite being fairly technical, everyone in the company — business, product, support, etc. — knew it was, at least at a high level, its purpose and so on.
This is ironic for many reasons. First of all, if I worked for Yahoo Search Marketing a few years ago, I would not be bragging about its relevance algorithms. Second, when did Yahoo Search start doing non-trivial stemming? Trust me, it wasn’t that long ago. And it’s not because people didn’t know what stemming was.
I find myself making two points about stemming and relevance over and over and I want to make it here once and for all:
(1) Just because someone is stemming/canonicalizing/spell-correcting in their presentation of search results does not mean they are stemming in their retrieval algorithm.
People always try to check “does Google do stemming?” by searching for “cars” and checking if a result has the word “car” highlighted. Many of my coworkers at Powerset and Yahoo who actually work on search engines have made this mistake.
Think about it like this: you are searching over 10 billion documents and often matching and ranking millions of them. Then you are highlighting 10 documents. Don’t you think that you might have different algorithms for displaying the results?!
(2) Stemming keyword search is really dangerous. Getting an overall relevance improvement is really hard.
One reasonable-sounding engineering strategy is to index the stemmed form of every word you see on the web and then search for the stemmed form of the query. This would be a complete disaster.
If you get a search for [cars] and you change the search to [cars OR car] it’s no big deal, but what about when someone searches [aids] and all they can get back is documents about [aid]? How do they undo it? You can always change your search for cars to [car] with stemming off, but how can you search for just [aids] with stemming on? There is a + operator that works for the major search engines but how many users know about it?
People who don’t work on relevance for web scale search engines often don’t realize just how amazingly useful anchor text is. Anchor text for a given page is the text in the html links from other pages linking to it. For a popular page (and these are the most important pages for a web search engine to return for keyword queries) it’s like tens or hundreds of different people sat down and annotated the page with titles. The best ten pages about cars most definitely have hundreds of links with the words car, cars, automobile, autmobile, etc. Stemming and canonicalization will never help you on a one or two word web query.
It will help you on an ad search query which is why Yahoo Search Marketing had stemming long before Yahoo Web Search. I am sure that Google/Yahoo/MSN all continue to have far more stemming on the query that gets sent to the advertising servers than the web servers.