eli5 why pdf files are “Madness inside.”

450 views

I made a passing comment of asking how hard it would be to convert a pdf file to another file format by writing a discord bot for it (for our ttrpg game) and one of the players said “Hell, because pdfs are madness inside.”

Can someone explain to me why pdfs are so weird?

Edit: a typo

In: 185

12 Answers

Anonymous 0 Comments

I’ve had to pleasure of diving into the PDF format a bunch of years ago.

The first complication is that PDF are not a “text document” they are instructions on what to draw on a number of pages. They are much more closely related to an image than a WORD document. In a Word document you have the text plus some instructions on how to display it. Then Word does the heavy lifting in making the text fit the screen you display it on or the page you print it on. This is great for editing, not so great for printing. PDFs on the other hand were made for printing, taking all that heavy lifting Word has made to figure out where to break a line / start a new page and save that so the program can very efficiently tell the printer “this goes there, this goes there, etc”. This makes the format horribly inconvenient for everything else. For one the words don’t have to be in the PDF in the same order that they appear on the page. This is also why selecting text in a PDF can be so wonky.

The next complication is that there are a lot of variations how these instructions are saved in the pdf. There is a very simple solution that you could actually read when you open the PDF with a plain text editor like NOTEPAD but there are also a lot more complicated versions. Pretty common is the variant that that all these instructions are compressed like with a ZIP file and that compressed data is then put into the PDF. But it can’t be put into the PDF as it is. Instead it needs to be converted into yet another format that *can* be put into the PDF. There are standard conversion methods to do such a thing. PDF has decided to use a different method, that noone else uses.

Then there is the fun of PDF being split into individual blocks and at the end of the PDF there is a table of contents for these blocks. You don’t actually need that because the format would work without it but if you mess up the TOC then the file is broken. Oh and the blocks don’t have to be in order from 1 to X either and can appear jumbled too.

I may be missing some good bits as it’s been years since I had to deal with this. But these are the “easy basics” that let you read the absolutely simplest documents.

Finally there’s the fact that PDFs are often scans of physical pages, which means you don’t actually have instructions like “this text goes here, that text goes there” but just a sea of “this bit is black, this bit is a dark-brownish grey” a few million times. And if your goal is to convert PDF to text you can add the whole fun of figuring out what words these smudges of colors could be ON TOP of the whole mess I explained before.

so TLDR: if you want to convert a PDF to TXT you’ve got to:

* skip all the way to the end and read the table of contents
* find and read each of the different blocks
* for each block figure out how what it is and how it is saved
* for at least some blocks you’re likely going to have to convert the super special ADOBE code into compressed data, figure out how it was compressed and uncompress it
* then try to find the text bits. which may not actually be text but an image which you then have to convert to text first.
* then you have to figure out in what order the text bits need to be read because they can be jumbled, there may be multiple columns, there may be tables, there may be text snaking through in unnatural directions, stuff might be tilted or skewed from the scanning process. You might actually be dealing with a collage of newspaper cutouts. YOU DON’T KNOW.

Oh and there MAY be a whole bunch of other shenanigans going on at ANY point along this path. Ah yes and that is for ONE version of PDF. earlier / later versions may work a bit differently at some points.

You are viewing 1 out of 12 answers, click here to view all answers.