简体   繁体   English

使用Apache PDFBox解析PDF文件

[英]Parsing PDF file using Apache PDFBox

I am trying to modify the contents of a PDF document using PDFBox . 我正在尝试使用PDFBox修改PDF文档的内容。 I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). 我按原样使用了此示例 ,但是观察到我的PDF文件中的文本在字符级别(或更糟)被分割了。 For example, a string, EM? what it is: 例如,字符串, EM? what it is: EM? what it is: gets split into: EM? what it is:被分成:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). (通过在上述代码中打印cosString进行检查时)。 As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. 据我所知,文件中只有拉丁字符,编码也为ISO-8859-1。 Any ideas? 有任何想法吗?

Regards, 问候,

Salil 萨利尔

This is most likely a PDF formatting issue. 这很可能是PDF格式问题。 That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . 这就是您特定的PDF存储文本的方式,以便获得正确的字母间距或字距调整 This varies greatly from PDF to PDF, depending on how they were created. PDF与PDF之间的差异很大,具体取决于创建方式。

Typically, I would suggest simply merging all the different tokens into one big content string. 通常,我建议将所有不同的令牌合并为一个大的内容字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM