I am trying to modify the contents of a PDF document using PDFBox . I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string, EM? what it is:
EM? what it is:
gets split into:
COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}
(when checked by printing the cosString
in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?
Regards,
Salil
This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . This varies greatly from PDF to PDF, depending on how they were created.
Typically, I would suggest simply merging all the different tokens into one big content string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.