Parsing PDF file using Apache PDFBox

Question

I am trying to modify the contents of a PDF document using PDFBox . I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string, EM? what it is: EM? what it is: gets split into:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?

Regards,

Salil

Answer 1

This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . This varies greatly from PDF to PDF, depending on how they were created.

Typically, I would suggest simply merging all the different tokens into one big content string.

Parsing PDF file using Apache PDFBox

Question

1 answers

solution1
1 2013-04-01 11:31:14

Parsing PDF file using Apache PDFBox

Question

1 answers

solution1 1 2013-04-01 11:31:14

solution1
1 2013-04-01 11:31:14