简体繁体 English

使用Java从pdf文件提取文本时面临的问题

[英]Facing issues on extracting text from pdf file using java

原文 2014-01-22 09:56:22 9 1 java/ pdf/ text-extraction

Not able to extract the text from the pdf which has Customer encryption fonts, which can identify by File -> Properties -> Font in Adobe reader. 无法从具有客户加密字体的pdf中提取文本，该字体可以通过Adobe Reader中的文件->属性->字体进行标识。 One of the font is mention as, C0EX02Q0_22 Type: Type 3 Encoding: Custom Actual Font: C0EX02Q0_22 Actual Font type: Type 3 其中一种字体称为C0EX02Q0_22类型：Type 3编码：自定义实际字体：C0EX02Q0_22实际字体类型：Type 3

Let me know is there any way to to extract the text content from such pdf files. 让我知道有什么方法可以从此类pdf文件中提取文本内容。 Currently i am using PDFText2HTML from pdf util. 目前，我正在使用pdf实用工具中的PDFText2HTML。 Get the values like 'ÁÙÅ@ÅÕãÉ' while extracting such pdf files 在提取此类pdf文件时获取“ÁÙÅ@ÅÕãÉ”之类的值

Sample pdf: tesis completa.pdf 样本pdf： tesis completa.pdf

In this pdf you could see the fonts used having custom encoding Eg: T3Font_1 (Please refer by File -> Properties -> Font in Adobe reader) Since i could not upload the my pdf updated the sample one having same issue 在此pdf中，您可以看到所使用的字体具有自定义编码，例如：T3Font_1（请通过File-> Properties-> Adobe Reader中的Font进行引用）由于无法上传我的pdf更新了具有相同问题的样本

1 个解决方案

Extraction as described in the standard 如标准中所述提取

The PDF specification ISO 32000-1 describes in section 9.10 Extraction of Text Content how text extraction can be done if the PDF provides the required information and does so correctly. PDF规范ISO 32000-1在9.10节中描述了文本内容的提取，如果PDF提供了所需的信息并且正确地进行了提取，那么如何进行文本提取。

Using this algorithm, though, only works in a few page ranges of the document (namely the summaries, the content lists, the thank-yous, and the section Publicación 7) but in the other ranges results in gibberish, eg 8QLYHUVLWDWGH/OHLGD instead of Universitat de Lleida . 但是，使用此算法只能在文档的几个页面范围内使用（即摘要，内容列表，感谢函和Publicación7部分），但在其他范围内则导致乱码，例如8QLYHUVLWDWGH/OHLGD Universitat de Lleida 。 Looking at the PDF objects in question makes clear that the required information are missing (no ToUnicode map and while the Encoding is based on WinAnsiEncoding , all positions in use are mapped via Differences to non-standard names). 查看有问题的PDF对象可以清楚地看到所需的信息已丢失（没有ToUnicode映射，并且在Encoding基于WinAnsiEncoding的情况下 ，所有使用中的位置均通过Differences映射到非标准名称）。

Also trying to extract the text using copy&paste from Adobe Reader returns that gibberish. 同样，尝试使用Adobe Reader中的复制粘贴来提取文本会返回该乱码。 This generally is a sign that generic extraction is not possible. 通常，这表明不可能进行通用提取。

A work-around 解决方法

Inspecting the PDF objects and the outputs of the generic text extraction attempt, though, gives rise to the idea that the actual encoding for the text extracted as gibberish is the same for all fonts used, and that it is some ASCII-based encoding shifted by a constant: Adding 'U' - '8' to each character of the extracted 8QLYHUVLWDWGH/OHLGD results in Universitat de Lleida . 但是，检查PDF对象和常规文本提取尝试的输出会产生这样的想法：对于所有使用的字体，提取为乱码的文本的实际编码是相同的，并且是一些基于ASCII的编码，偏移了一个常数：在Universitat de Lleida的提取的8QLYHUVLWDWGH/OHLGD结果的每个字符中添加'U' - '8' 。 Adding the same constant to the chars from text extracted elsewhere in the document also results in correct text as long as the text only uses ASCII characters. 只要文本仅使用ASCII字符，向从文档中其他位置提取的文本的char中添加相同的常量也将导致正确的文本。

Characters outside the ASCII range are not mapped correctly by that simple method, but they also always seem to be extracted as the same wrong character, eg the glyph 'ó' always is extracted as 'y'. 用这种简单的方法无法正确映射ASCII范围以外的字符，但它们似乎总是被提取为相同的错误字符，例如，字形“ó”始终被提取为“ y”。

Thus, you can extract the text from that (and similarly created) documents by first extracting the text using the standard algorithm and then in the gibberish sections (which probably can be identified by font name) replacing each character by adding 'U' - '8' for small values and by replacing according to some mapping for higher values. 因此，您可以通过以下方法从该（以及类似创建的）文档中提取文本：首先使用标准算法提取文本，然后在乱码部分（可能可以由字体名称标识）中，通过添加'U' - '8'替换每个字符'U' - '8'表示较小的值，并通过根据某些映射进行替换以获得较高的值。

As you mentioned Java in your question, I have run your document through iText and PDFBox text extraction with and without shifting by 'U' - '8' , and the results look promising. 正如您在问题中提到的Java一样，我已经通过iText和PDFBox文本提取来运行您的文档，并且带有或不带有'U' - '8'移位，结果看起来很有希望。 I assume other general-purpose Java PDF libraries will also work. 我认为其他通用Java PDF库也将起作用。

Another work-around 另一个解决方法

Instead of creating custom extraction routines, you can try to fix the PDFs in question by adding ToUnicode map entries to the fonts in question. 您可以尝试通过将ToUnicode映射条目添加到相关字体中来尝试修复相关PDF，而不是创建自定义提取例程。 After that normal text extraction programs should be able to properly extract the contents. 之后，普通的文本提取程序应能够正确提取内容。