Java - PDFBox - 文本提取

Question

I have been using pdfbox for extracting text information from PDFs. 我一直在使用pdfbox从PDF中提取文本信息。 I have succesfully parsed all properties of text such as fontname , fontface , size ,position etc. 我成功地解析了文本的所有属性，如fontname，fontface，size，position等。

PROBLEM: I am using pdfbox1.2.1(latest version). 问题：我使用的是pdfbox1.2.1（最新版本）。 The getCharacter() in TextPosition class returns the full string except the last character. TextPosition类中的getCharacter（）返回除最后一个字符之外的完整字符串。 The last character is parsed as a separate string. 最后一个字符被解析为单独的字符串。

Ex: "How are you" is parsed as "How are yo" and "u" (2 separate strings). 例如：“你好吗”被解析为“如何哟”和“你”（2个单独的字符串）。

I dont want it to happen that way.. 我不希望它发生那种方式..

Has anybody come accross this? 有没有人来过这个？ .. Am i doing something wrong??.. Waiting for reply.. ..我做错了什么??等待回复..

Thanks and Regards, Magggi 谢谢和问候，Magggi

Answer 1

This issue is solved. 这个问题解决了。

The following code in processEncodedText( byte[] string ) in PDFStreamEngine.java PDFStreamEngine.java中的processEncodedText( byte[] string )中的以下代码

if( spacingText == 0 && (i + codeLength) < (string.length - 1) )
{
    continue;
}

should be changed to 应改为

if( spacingText == 0 && (i + codeLength) < (string.length) )
{
    continue;
}

Regards, Maggi 此致，Maggi

Answer 2

Yes. 是。 This issue is solved by pdfbox. 这个问题由pdfbox解决。
Try latest version of pdfbox. 试用最新版本的pdfbox。 The latest version can be downloaded from http://pdfbox.apache.org/download.html 最新版本可以从http://pdfbox.apache.org/download.html下载

Java - PDFBox - 文本提取

问题描述

2 个解决方案

解决方案1
3 2010-08-30 12:09:29

解决方案2
1 2012-06-30 05:17:54

Java - PDFBox - 文本提取

问题描述

2 个解决方案

解决方案1 3 2010-08-30 12:09:29

解决方案2 1 2012-06-30 05:17:54

解决方案1
3 2010-08-30 12:09:29

解决方案2
1 2012-06-30 05:17:54