A PDF contains *Postscript*, which is a Page Description Language. Postscript is a text programming language (turing complete) that uses a stack mechanism to render text and images to a display – either a printer page, or a screen. It does this by placing elements on the defined page – these elements include Glyphs (single characters from a font), images, lines, and shapes. Postscript fonts are a complex set of glyphs that also get embedded into the page if the output device does not have defined embedded fonts.
However, the mechanism by which the Postscript is generated is critical. Back in the 90’s we spent a lot of time looking at Postscript generated by Wordperfect – the leading wordprocessor of the time. It’s output wasn’t publication ready. In particular, the kerning (character to character spacing) was really bad. Looking at the output of the Postscript printer driver, we could see that for every letter on the page, an individual glyph was being placed, and they were not always in order. My boss spent weeks creating specific kerning settings for all the approved fonts to meet the publication requirements.
Scanning to PDF is likely to just embed bitmap images on to each page of the PDF.
If you are really lucky, the PDF generator will actually contain text strings with an instruction to render it to the page in a specific font at a specific place. That is the easiest way to use Postscript, and allows text extraction. But it is not guaranteed. So unless you control the Postscript generator, PDF to text (or other editable format) is really hard. Generally, it is probably easier to render the PDF page to an image and then use Optical Character Recognition (OCR) to recreate the output.
Latest Answers