简体   繁体   中英

Parsing PDF file using Apache PDFBox

I am trying to modify the contents of a PDF document using PDFBox . I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string, EM? what it is: EM? what it is: gets split into:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?

Regards,

Salil

This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . This varies greatly from PDF to PDF, depending on how they were created.

Typically, I would suggest simply merging all the different tokens into one big content string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM