Monday, June 2, 2014

CAPTCHA

Did you know…
I was thinking of starting a new topic line that would be filled with interesting and, in my opinion, not widely known facts. Just things I’ve heard that have caused me to scratch my head and wonder how many other people out there didn’t know that, or made me wonder how I could have NOT known whatever that was.

My first “Did you know…” topic is about a CAPTCHA. Has anybody ever heard this word? I’m sure anybody that has tried to make an online purchase has come across a CAPTCHA, though, if they are like me, never gave a thought as to its name and only vaguely understood its purpose.
Luis von Ahn is the computer programmer that created the CAPTCHA. I heard his TED talk, Why We Collaborate, this morning and was amazed at what I didn’t know. I thought this thing was just a way for websites to determine that the entity filling out an online form was actually a human being and not a computer program that was on autopilot submitting the form millions of times. And that is true, and that was the purpose behind the CAPTCHA’s creation. Luis von Ahn says, “The reason it works is because humans, at least non-visually-impaired humans, have no trouble reading these distorted squiggly characters, whereas computer programs simply can't do it as well yet. So for example, in the case of Ticketmaster, the reason you have to type these distorted characters is to prevent scalpers from writing a program that can buy millions of tickets, two at a time.” It was just a security step. I accepted it, deciphered the scribbled words, and moved on. Not a second thought what-so-ever.

Well, did you know… that every time a person fills in the box, they are helping to transcribe physical books into digital ones?

What? Huh? Are you kidding me?

So, turns out, approximately 200 million CAPTCHAs are typed everyday by people around the world. It takes a person about 10 seconds to type the CAPTCHA, which equates to 500,000 man hours a day, wasted. Von Ahn thought those hours could be put to better use. Amazingly, he created another program, called reCAPTCHA, that turns those decoded squiggly letters into digitize books.

Here is an excerpt from his TED talk:
“So what you may not know is that nowadays while you're typing a CAPTCHA, not only are you authenticating yourself as a human, but in addition you're actually helping us to digitize books. So let me explain how this works. So there's a lot of projects out there trying to digitize books. Google has one. The Internet Archive has one. Amazon, now with the Kindle, is trying to digitize books. Basically the way this works is you start with an old book. You've seen those things, right? Like a book? (Laughter) So you start with a book, and then you scan it. Now scanning a book is like taking a digital photograph of every page of the book. It gives you an image for every page of the book. This is an image with text for every page of the book. The next step in the process is that the computer needs to be able to decipher all of the words in this image. That's using a technology called OCR, for optical character recognition, which takes a picture of text and tries to figure out what text is in there. Now the problem is that OCR is not perfect. Especially for older books where the ink has faded and the pages have turned yellow, OCR cannot recognize a lot of the words. For example, for things that were written more than 50 years ago, the computer cannot recognize about 30 percent of the words. So what we're doing now is we're taking all of the words that the computer cannot recognize and we're getting people to read them for us while they're typing a CAPTCHA on the Internet.”

 Crazy, right? Did you know that?

“So the next time you type a CAPTCHA, these words that you're typing are actually words that are coming from books that are being digitized that the computer could not recognize. And now the reason we have two words nowadays instead of one is because, you see, one of the words is a word that the system just got out of a book, it didn't know what it was, and it's going to present it to you. But since it doesn't know the answer for it, it cannot grade it for you. So what we do is we give you another word, one for which the system does know the answer. We don't tell you which one's which, and we say, please type both. And if you type the correct word for the one for which the system already knows the answer, it assumes you are human, and it also gets some confidence that you typed the other word correctly. And if we repeat this process to like 10 different people and all of them agree on what the new word is, then we get one more word digitized accurately.”

Von Ahn says that so many websites have incorporated his reCAPTCHA program that 100 million words a day are being captured, which adds up to 2.5 million books a year.

Duolingo is another one of von Ahn’s crazy program ideas. It’s a language learning program? Ever hear of it? If not, check it out. It’s free. I’m using it to learn Spanish and it’s pretty cool. I actually enjoy it, almost like a game (give up Angry Birds and install Duolingo!). The crazy part… the program uses the student’s responses to translate the internet. Yep! Translate the entire internet! Crazy!


No comments:

Post a Comment