Did you know…
I was thinking of starting a new topic line that would be
filled with interesting and, in my opinion, not widely known facts. Just things
I’ve heard that have caused me to scratch my head and wonder how many other
people out there didn’t know that, or made me wonder how I could have NOT known
whatever that was.
My first “Did you know…” topic is about a CAPTCHA. Has
anybody ever heard this word? I’m sure anybody that has tried to make an online
purchase has come across a CAPTCHA, though, if they are like me, never gave a
thought as to its name and only vaguely understood its purpose.
Luis von Ahn is the computer programmer that created the
CAPTCHA. I heard his TED talk, Why We Collaborate, this morning and was amazed at
what I didn’t know. I thought this thing was just a way for websites to
determine that the entity filling out an online form was actually a human being
and not a computer program that was on autopilot submitting the form millions
of times. And that is true, and that was the purpose behind the CAPTCHA’s
creation. Luis von Ahn says, “The reason it works is because humans, at least
non-visually-impaired humans, have no trouble reading these distorted squiggly
characters, whereas computer programs simply can't do it as well yet. So for
example, in the case of Ticketmaster, the reason you have to type these
distorted characters is to prevent scalpers from writing a program that can buy
millions of tickets, two at a time.” It was just a security step. I accepted
it, deciphered the scribbled words, and moved on. Not a second thought
what-so-ever.
Well, did you know… that every time a person fills in the
box, they are helping to transcribe physical books into digital ones?
What? Huh? Are you kidding me?
So, turns out, approximately 200 million CAPTCHAs are typed
everyday by people around the world. It takes a person about 10 seconds to type
the CAPTCHA, which equates to 500,000 man hours a day, wasted. Von Ahn thought
those hours could be put to better use. Amazingly, he created another program,
called reCAPTCHA, that turns those decoded squiggly letters into digitize
books.
Here is an excerpt from his TED talk:
“So what you may not know is that
nowadays while you're typing a CAPTCHA, not only are you authenticating
yourself as a human, but in addition you're actually helping us to digitize
books. So let me explain how this works. So there's a lot of projects out there
trying to digitize books. Google has one. The Internet Archive has one. Amazon,
now with the Kindle, is trying to digitize books. Basically the way this works is
you start with an old book. You've seen those things, right? Like a book? (Laughter)
So you start with a book, and then you scan it. Now scanning a book is like
taking a digital photograph of every page of the book. It gives you an image
for every page of the book. This is an image with text for every page of the
book. The next step in the process is that the computer needs to be able to
decipher all of the words in this image. That's using a technology called OCR, for
optical character recognition, which takes a picture of text and tries to
figure out what text is in there. Now the problem is that OCR is not perfect. Especially
for older books where the ink has faded and the pages have turned yellow, OCR
cannot recognize a lot of the words. For example, for things that were written
more than 50 years ago, the computer cannot recognize about 30 percent of the
words. So what we're doing now is we're taking all of the words that the
computer cannot recognize and we're getting people to read them for us while
they're typing a CAPTCHA on the Internet.”
Crazy, right? Did you
know that?
“So the next time you type a
CAPTCHA, these words that you're typing are actually words that are coming from
books that are being digitized that the computer could not recognize. And now
the reason we have two words nowadays instead of one is because, you see, one
of the words is a word that the system just got out of a book, it didn't know what
it was, and it's going to present it to you. But since it doesn't know the
answer for it, it cannot grade it for you. So what we do is we give you another
word, one for which the system does know the answer. We don't tell you which
one's which, and we say, please type both. And if you type the correct word for
the one for which the system already knows the answer, it assumes you are
human, and it also gets some confidence that you typed the other word
correctly. And if we repeat this process to like 10 different people and all of
them agree on what the new word is, then we get one more word digitized
accurately.”
Von Ahn says that so
many websites have incorporated his reCAPTCHA program that 100 million words a
day are being captured, which adds up to 2.5 million books a year.
Duolingo is another one of von Ahn’s crazy program ideas. It’s
a language learning program? Ever hear of it? If not, check it out. It’s free.
I’m using it to learn Spanish and it’s pretty cool. I actually enjoy it, almost
like a game (give up Angry Birds and install Duolingo!). The crazy part… the
program uses the student’s responses to translate the internet. Yep! Translate
the entire internet! Crazy!
You can find the whole TED talk at http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration
No comments:
Post a Comment