Eli5: how were things like old books digitized for the internet? Did someone scan each page? Did they just re-type it word for word?

328 viewsOtherTechnology

I was thinking about public domain books and it got me wondering.

In: Technology

5 Answers

Anonymous 0 Comments

My guess would be first scanning, then converting the results to text using optical character recognition. Maybe have a human proofreader check over everything for formatting or glitches.

Anonymous 0 Comments

digital scans and captcha codes and OCR software.

Remember about a decade ago when all of captcha was “What word is this???” That was words that the digitizer wasnt sure about, so it was crowdsourced to everyone on the internet to figure out.

Anonymous 0 Comments

If you do it at a large scale (like in a library or so), there are fully automated book scanners who can flip pages and scan it.
So basically you have to lay a book on it, and come back some times later, when the scan is finished. Then you have a bunch of images which you can convert easily into a text file.

Anonymous 0 Comments

Yep, scanning. There are specialized machines [like this](https://www.youtube.com/watch?v=cmhIJOqepVU) that can scan books pretty fast. Errors were corrected by tools such as reCAPTCHA that we’ve seen on just about every login page on the web.

Anonymous 0 Comments

You can get whole book scanners that is able to flip each page and scan them. These use different technology at different prices based on how fast they scan and how much damage they do to the books. Some do require a curator to operate them to be as gentle with the books as possible while others can flip through a book faster then your eye can see and might tear some of the pages while doing so.

We have been scanning books in this way for quite some time. At first people were actually manually retyping the books, at least some of them. And this is still done to a lot of older hand written books. If you do any research into historical works you will find a lot of census records, church records, log books, etc. published as raw images where you are expected to interpret the writing yourself and then help others by retyping the content in order to digitalise them.

But for printed books and neatly written books we do have algorithms that can interpret the text automatically. So called Optical Character Recognition, OCR. These have gotten better as we have more training data to test these on. A big leap forward in this was done by reCAPTIA. They combined the concept of CAPTIA to verify if a user is a human or a bot by presenting them with a picture of some mangled characters and having the user type these characters, with the problem of generating training data for the OCR. By presenting the users with difficult scans from books and comparing their answer to that others have made they were able to generate a huge set of training data for their OCR algorithm. They were then bought by Google for this technology and the training data.