eli5 why pdf files are “Madness inside.”

448 views

I made a passing comment of asking how hard it would be to convert a pdf file to another file format by writing a discord bot for it (for our ttrpg game) and one of the players said “Hell, because pdfs are madness inside.”

Can someone explain to me why pdfs are so weird?

Edit: a typo

In: 185

12 Answers

Anonymous 0 Comments

A PDF contains *Postscript*, which is a Page Description Language. Postscript is a text programming language (turing complete) that uses a stack mechanism to render text and images to a display – either a printer page, or a screen. It does this by placing elements on the defined page – these elements include Glyphs (single characters from a font), images, lines, and shapes. Postscript fonts are a complex set of glyphs that also get embedded into the page if the output device does not have defined embedded fonts.

However, the mechanism by which the Postscript is generated is critical. Back in the 90’s we spent a lot of time looking at Postscript generated by Wordperfect – the leading wordprocessor of the time. It’s output wasn’t publication ready. In particular, the kerning (character to character spacing) was really bad. Looking at the output of the Postscript printer driver, we could see that for every letter on the page, an individual glyph was being placed, and they were not always in order. My boss spent weeks creating specific kerning settings for all the approved fonts to meet the publication requirements.

Scanning to PDF is likely to just embed bitmap images on to each page of the PDF.

If you are really lucky, the PDF generator will actually contain text strings with an instruction to render it to the page in a specific font at a specific place. That is the easiest way to use Postscript, and allows text extraction. But it is not guaranteed. So unless you control the Postscript generator, PDF to text (or other editable format) is really hard. Generally, it is probably easier to render the PDF page to an image and then use Optical Character Recognition (OCR) to recreate the output.

You are viewing 1 out of 12 answers, click here to view all answers.