why does copy/pasting from pdfs sometimes have a problem with the ‘ti’ combination of letters?

393 viewsOtherTechnology

seems that when you copy and paste, sometimes there’s a question mark in a box replacing letters, particularly t or ti in some strange unicode error. but why in my experience does it only affect these characters?

In: Technology

4 Answers

Anonymous 0 Comments

PDF is not a text format its a format designed for printers.

A PDF can literaly just be a picture of some text. Unlike a word document thats actualy a text formating software.

There is thousands of different ways to still extract text from an PDF, but in some cases the information of the text is just not parr of the PDF file instead its just a bunch of pixels.

Some software might to try OCR (optical character recognition) but thats just not accurate in every case.

Anonymous 0 Comments

Because these text has been replaced by a different ligature character, when copying from PDF, the code for replacement is copied instead of the original text. Pasting that into the destination text box, that system tries to render a text code that it can’t find, resulting in an error.

Anonymous 0 Comments

Since pdfs were never intended for editing, but to send files extremely accurately to printers no matter which system, a generated pdf will not rely on installed fonts on your system, but embed them in the pdf file, this also allows any letter combinations that are commonly connected in many typefaces, like ti or similar to not be stored as individual t and i letters but rather as a combined symbol for ti. 

When you copy and paste that text into a text box expecting standard asci or Unicode symbols, it won’t have a ‘picture’ for that ti symbol to display, and default to the box to show that there is a symbol supposed to go here, but the system has no font installed to display it.

Anonymous 0 Comments

Some fonts have special characters called “ligatures” which are combinations of letters designed specifically to look good together – as a single character. For example, “ti” – in some fonts, the cross of the t and the dot of the i can overlap and look weird, so there’s a special character which is a specifically-designed version of t and i together.

Copying and pasting from PDFs is a dicey prospect at best. The text in a PDF isn’t actually “text” – it’s a weird compressed processed image OF text. When you attempt to copy the text out of it, your computer uses a text-recognition algorithm to do the best it can at identifying the letters. Weird characters like ligatures sometimes confuse it and you get errors.