Eli5: how were things like old books digitized for the internet? Did someone scan each page? Did they just re-type it word for word?

336 viewsOtherTechnology

I was thinking about public domain books and it got me wondering.

In: Technology

5 Answers

Anonymous 0 Comments

You can get whole book scanners that is able to flip each page and scan them. These use different technology at different prices based on how fast they scan and how much damage they do to the books. Some do require a curator to operate them to be as gentle with the books as possible while others can flip through a book faster then your eye can see and might tear some of the pages while doing so.

We have been scanning books in this way for quite some time. At first people were actually manually retyping the books, at least some of them. And this is still done to a lot of older hand written books. If you do any research into historical works you will find a lot of census records, church records, log books, etc. published as raw images where you are expected to interpret the writing yourself and then help others by retyping the content in order to digitalise them.

But for printed books and neatly written books we do have algorithms that can interpret the text automatically. So called Optical Character Recognition, OCR. These have gotten better as we have more training data to test these on. A big leap forward in this was done by reCAPTIA. They combined the concept of CAPTIA to verify if a user is a human or a bot by presenting them with a picture of some mangled characters and having the user type these characters, with the problem of generating training data for the OCR. By presenting the users with difficult scans from books and comparing their answer to that others have made they were able to generate a huge set of training data for their OCR algorithm. They were then bought by Google for this technology and the training data.

You are viewing 1 out of 5 answers, click here to view all answers.