eli5 why pdf files are “Madness inside.”

438 views

I made a passing comment of asking how hard it would be to convert a pdf file to another file format by writing a discord bot for it (for our ttrpg game) and one of the players said “Hell, because pdfs are madness inside.”

Can someone explain to me why pdfs are so weird?

Edit: a typo

In: 185

12 Answers

Anonymous 0 Comments

**tl;dr:** PDFs are far more complicated internally than most people realize.

For one thing, PDF files are *programs* that, when run, produce a rendered document. It is (or at least used to be) a simplified version of PostScript, another document language.

Being programs, they are not just “lumps of bits” on the disk, they are a potential attack vector. There was a time when the DoD banished them from sensitive installations. Adobe finally got their act together and fixed many (but not all) of the vulnerabilities.

Secondly, many PDFs are simply collections of scans of pages, i.e. they are images. That makes “converting them” to text a bit more complicated, especially if the scans are skewed, dirty, or a little bit out of focus.

Anonymous 0 Comments

PDFs come in many types.
Sometimes these types of PDF are not in a machine-recognizable form; they can be in a strange variation of what a PDF should be – this usually happens when the app used to make the PDF is not an Adobe product.
Sometimes the text in a page can be turned into an image, or a huge collection of images all arranged as one large image, mixed with occasional extra bits of text thrown in just to confuse you.

Anonymous 0 Comments

If the question is “why/for what reason are they complicated?” and not “in what ways are they complicated?” then there are two reasons.
The first is they’re made to incorporate a large number of different media types and formats, and allow them to be displayed in a unified and not garbled way. That makes them very complicated.

The second reason is that of security, to intentionally make them very difficult to reverse-engineer or decipher in the exact way that you’ve proposed, without obtaining software from Adobe, inc. Such software typically consists of a series of “binary blobs” or binary executable programs that are provided in machine code, which just looks like a bunch of near meaningless hexadecimal numbers if directly translated into text.

In general Adobe cares about this kind of security not because of amateur level interest like yours, but because criminals and state espionage organizations have a much bigger financial incentive to reverse engineer software and find ways to exploit something like a PDF in order to gain access to other’s computers. Usually, in order to trick a compromized computer’s operating system into installing spy-ware or ransom-ware.

Programs are available that can decompile binary executable files into written source code, but the output tends to be difficult to read at best if not entirely gibberish.

These days, executable files from popular software companies are usually created with internal features to make them deliberately difficult to reverse-engineer or decompile. Good luck getting Adobe to provide you with their own written source code for acrobat, for example. They’ll just hang up.

Anonymous 0 Comments

Software engineer here, I have been working with PDF files for the majority of my career. I believe the main reason why converting PDF files to other formats would be hell, and most certainly It would be, is because of the sheer number of variations you can have inside a PDF. Acrobat itself struggles to keep up with the PDF specs (at least it did in the past).

The need to make the format portable and thus self-contained and at the same time versatile and multi-purpose, has led to a specification which is so complex that no software can be even be sure to support all its flavours and nouances, let alone interpret them consistently.

Writing PDF files is relatively easy, as you can choose to do it as simply as you like; reading them is the hard part, and by far.

Anonymous 0 Comments

In engineering everything is a tradeoff to achieve a stated goal.

What is a stated design goal of PDF?

1. It should be easily sent to printers
2. It should be rendered the same on any machine (regardless of fonts, OS, graphic adapters, locales, etc).
3. It should be small size for large documents (hundreds of pages)

You see how there is no goal “It should be easy to extract meaningful information from a document”?

PDF documents (and programs that create PDFs) are concerned only about how it looks, not that content is semantically makes sense.

For example, if you have 5 paragraphs on a page, there is no guarantee that they will go in the same order in the document file. The only thing that matters is how it looks.

For this reason PDF is almost as hard to read as a picture. And programs that do read PDFs do it because they coded hundreds and hundreds of real-world PDF hacks into their readers.

Anonymous 0 Comments

A PDF contains *Postscript*, which is a Page Description Language. Postscript is a text programming language (turing complete) that uses a stack mechanism to render text and images to a display – either a printer page, or a screen. It does this by placing elements on the defined page – these elements include Glyphs (single characters from a font), images, lines, and shapes. Postscript fonts are a complex set of glyphs that also get embedded into the page if the output device does not have defined embedded fonts.

However, the mechanism by which the Postscript is generated is critical. Back in the 90’s we spent a lot of time looking at Postscript generated by Wordperfect – the leading wordprocessor of the time. It’s output wasn’t publication ready. In particular, the kerning (character to character spacing) was really bad. Looking at the output of the Postscript printer driver, we could see that for every letter on the page, an individual glyph was being placed, and they were not always in order. My boss spent weeks creating specific kerning settings for all the approved fonts to meet the publication requirements.

Scanning to PDF is likely to just embed bitmap images on to each page of the PDF.

If you are really lucky, the PDF generator will actually contain text strings with an instruction to render it to the page in a specific font at a specific place. That is the easiest way to use Postscript, and allows text extraction. But it is not guaranteed. So unless you control the Postscript generator, PDF to text (or other editable format) is really hard. Generally, it is probably easier to render the PDF page to an image and then use Optical Character Recognition (OCR) to recreate the output.

Anonymous 0 Comments

Hi. I’m a software engineer and I’ve had the displeasure of trying to work with PDFs programmatically.

PDF is a proprietary file format owned by Adobe. They don’t release publicly usable code for directly reading and manipulating PDFs, they just sell end user software (like Acrobat) that does this. Open source software options for working with PDFs are limited.

The file format itself isn’t some sane thing like you might see in an XML document. It’s a very weird mixture of text and binary data, images and formatting codes interspersed haphazardly. The internal structure of the file is not designed for human readability and it generally isn’t readable until rendered by a PDF rendering engine, though it’s possible to kinda sniff out some blocks of text if you look in the right place.

It’s basically just an uphill battle to try and work with it directly. Adobe doesn’t want you to and they’ve made sure there aren’t good tools available for that purpose.

The file format itself is weird and difficult because it wasn’t really designed to be anything except data storage for PDF software. It’s got lots of weird choices that are the result of feature development for Acrobat rather than being premeditated extensions to a published data format. PDF has never been a published data format. It’s purpose is to support commercial software owned by a particular vendor. Being usable by other software systems was never a design goal.

Anonymous 0 Comments

Just to reiterate how PDFs are essentially programs that just happen to usually consist of text and images, you can run games in them given certain conditions.

https://github.com/osnr/horrifying-pdf-experiments

Anonymous 0 Comments

you might consider writing a game engine for foundryvtt.

wouldn’t help translating the pdf, unless someone has already done so. It would help the next person along that has the same idea as you.

Anonymous 0 Comments

I’ve had to pleasure of diving into the PDF format a bunch of years ago.

The first complication is that PDF are not a “text document” they are instructions on what to draw on a number of pages. They are much more closely related to an image than a WORD document. In a Word document you have the text plus some instructions on how to display it. Then Word does the heavy lifting in making the text fit the screen you display it on or the page you print it on. This is great for editing, not so great for printing. PDFs on the other hand were made for printing, taking all that heavy lifting Word has made to figure out where to break a line / start a new page and save that so the program can very efficiently tell the printer “this goes there, this goes there, etc”. This makes the format horribly inconvenient for everything else. For one the words don’t have to be in the PDF in the same order that they appear on the page. This is also why selecting text in a PDF can be so wonky.

The next complication is that there are a lot of variations how these instructions are saved in the pdf. There is a very simple solution that you could actually read when you open the PDF with a plain text editor like NOTEPAD but there are also a lot more complicated versions. Pretty common is the variant that that all these instructions are compressed like with a ZIP file and that compressed data is then put into the PDF. But it can’t be put into the PDF as it is. Instead it needs to be converted into yet another format that *can* be put into the PDF. There are standard conversion methods to do such a thing. PDF has decided to use a different method, that noone else uses.

Then there is the fun of PDF being split into individual blocks and at the end of the PDF there is a table of contents for these blocks. You don’t actually need that because the format would work without it but if you mess up the TOC then the file is broken. Oh and the blocks don’t have to be in order from 1 to X either and can appear jumbled too.

I may be missing some good bits as it’s been years since I had to deal with this. But these are the “easy basics” that let you read the absolutely simplest documents.

Finally there’s the fact that PDFs are often scans of physical pages, which means you don’t actually have instructions like “this text goes here, that text goes there” but just a sea of “this bit is black, this bit is a dark-brownish grey” a few million times. And if your goal is to convert PDF to text you can add the whole fun of figuring out what words these smudges of colors could be ON TOP of the whole mess I explained before.

so TLDR: if you want to convert a PDF to TXT you’ve got to:

* skip all the way to the end and read the table of contents
* find and read each of the different blocks
* for each block figure out how what it is and how it is saved
* for at least some blocks you’re likely going to have to convert the super special ADOBE code into compressed data, figure out how it was compressed and uncompress it
* then try to find the text bits. which may not actually be text but an image which you then have to convert to text first.
* then you have to figure out in what order the text bits need to be read because they can be jumbled, there may be multiple columns, there may be tables, there may be text snaking through in unnatural directions, stuff might be tilted or skewed from the scanning process. You might actually be dealing with a collage of newspaper cutouts. YOU DON’T KNOW.

Oh and there MAY be a whole bunch of other shenanigans going on at ANY point along this path. Ah yes and that is for ONE version of PDF. earlier / later versions may work a bit differently at some points.