[英]Parsing PDF file using Apache PDFBox
I am trying to modify the contents of a PDF document using PDFBox . 我正在尝试使用PDFBox修改PDF文档的内容。 I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). 我按原样使用了此示例 ,但是观察到我的PDF文件中的文本在字符级别(或更糟)被分割了。 For example, a string, EM? what it is:
例如,字符串, EM? what it is:
EM? what it is:
gets split into: EM? what it is:
被分成:
COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}
(when checked by printing the cosString
in the above mentioned code). (通过在上述代码中打印cosString
进行检查时)。 As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. 据我所知,文件中只有拉丁字符,编码也为ISO-8859-1。 Any ideas? 有任何想法吗?
Regards, 问候,
Salil 萨利尔
This is most likely a PDF formatting issue. 这很可能是PDF格式问题。 That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . 这就是您特定的PDF存储文本的方式,以便获得正确的字母间距或字距调整 。 This varies greatly from PDF to PDF, depending on how they were created. PDF与PDF之间的差异很大,具体取决于创建方式。
Typically, I would suggest simply merging all the different tokens into one big content string. 通常,我建议将所有不同的令牌合并为一个大的内容字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.