With PDFs you don’t have single way to structure your content. It’s a WYSIWYG world.
If you have 2 identical looking PDFs, one may be nicely structured internally, containing all the raw text and graphics in a sequence closely matching the sequence it is displayed in. The other PDF can have the content strewn all over the inside of the file using absolute positioning and using only images and gliphs for the text. It all depends on who and how the particular PDF was made.
For converting: if your PDFs have the same nice structure and you want their text then it’s straight forward. If you find you’re having a lot of trouble converting a pile of them then consider other tools like OCR and using ML to extract images and positional data if needed.
Latest Answers