Hi. I’m a software engineer and I’ve had the displeasure of trying to work with PDFs programmatically.
PDF is a proprietary file format owned by Adobe. They don’t release publicly usable code for directly reading and manipulating PDFs, they just sell end user software (like Acrobat) that does this. Open source software options for working with PDFs are limited.
The file format itself isn’t some sane thing like you might see in an XML document. It’s a very weird mixture of text and binary data, images and formatting codes interspersed haphazardly. The internal structure of the file is not designed for human readability and it generally isn’t readable until rendered by a PDF rendering engine, though it’s possible to kinda sniff out some blocks of text if you look in the right place.
It’s basically just an uphill battle to try and work with it directly. Adobe doesn’t want you to and they’ve made sure there aren’t good tools available for that purpose.
The file format itself is weird and difficult because it wasn’t really designed to be anything except data storage for PDF software. It’s got lots of weird choices that are the result of feature development for Acrobat rather than being premeditated extensions to a published data format. PDF has never been a published data format. It’s purpose is to support commercial software owned by a particular vendor. Being usable by other software systems was never a design goal.
Latest Answers