reCAPTCHA
Last week I saw Luis von Ahn give a talk about reCAPTCHA, his latest project to stop spammers and do OCR on scanned books at the same time. Normally, CAPTCHAs are used by sites like Ticketmaster and Yahoo and show you an image of a random collection of characters that are distorted in a way that humans can read them but computer programs can’t.
reCAPTCHAs take text from old books that have been scanned in, but the OCR program had a low confidence in its transcription of the word. It shows that word to a user and at the same time it also shows a different word that the OCR correctly transcribed.
If the user enters the known word correctly, they are assumed to be human. The users transcription of the unknown word is then used as a gold standard transcription. I believe Luis said that if they require two people to transcribe a given word in this way, the accuracy is above 99 percent.
Apparently users on the internet do something like 60,000,000 CAPTCHAs a day, and transcribing a word costs around 0.5 cents, so this project is making lots of transcriptions of books that wouldn’t otherwise be possible. It helps people with bad eyesight and makes a great training corpus for OCR research.
The beauty of this project is that it will always be one step ahead of OCR programs. CAPTCHAs have had to get tougher and tougher over the years as OCR systems get better. But as long as Luis is using an OCR program that is near state of the art, if his program can’t figure out the correct transcription it’s an impossible task for other OCR programs.
I’ve set it up so that if you try to enter a comment on this blog, you can see reCAPTCHA at work.
February 27th, 2008 at 3:57 pm
You have to actually enter a comment to try the reCAPTCHA
June 4th, 2008 at 8:29 pm
Thanks for the insight into what recaptcha was doing. I ran into their test on another site earlier today and couldn’t figure out why they were showing an impossible image - on one of the words the first two (three??) letters were *clearly* not in the alphabet! I guess that explains why the original ocr wasn’t confident.