使用Apache PDFBox解析PDF文件

Question

I am trying to modify the contents of a PDF document using PDFBox . 我正在尝试使用PDFBox修改PDF文档的内容。 I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). 我按原样使用了此示例，但是观察到我的PDF文件中的文本在字符级别（或更糟）被分割了。 For example, a string, EM? what it is: 例如，字符串， EM? what it is: EM? what it is: gets split into: EM? what it is:被分成：

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). （通过在上述代码中打印cosString进行检查时）。 As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. 据我所知，文件中只有拉丁字符，编码也为ISO-8859-1。 Any ideas? 有任何想法吗？

Regards, 问候，

Salil 萨利尔

Answer 1

This is most likely a PDF formatting issue. 这很可能是PDF格式问题。 That is how your particular PDF stores the text in order to get correct letter spacing or for kerning . 这就是您特定的PDF存储文本的方式，以便获得正确的字母间距或字距调整。 This varies greatly from PDF to PDF, depending on how they were created. PDF与PDF之间的差异很大，具体取决于创建方式。

Typically, I would suggest simply merging all the different tokens into one big content string. 通常，我建议将所有不同的令牌合并为一个大的内容字符串。

使用Apache PDFBox解析PDF文件

问题描述

1 个解决方案

解决方案1
1 2013-04-01 11:31:14

使用Apache PDFBox解析PDF文件

问题描述

1 个解决方案

解决方案1 1 2013-04-01 11:31:14

解决方案1
1 2013-04-01 11:31:14