简体繁体 English

通过pdfbox从Linearized PDF中提取文本

[英]Extract text from Linearized PDF by pdfbox

原文 2022-09-07 06:13:32 3 2 java/ pdf/ pdfbox

I am using org.apache.pdfbox.text.PDFTextStripper version 2.0.26.我正在使用 org.apache.pdfbox.text.PDFTextStripper 版本 2.0.26。 It works good for most PDFs.它适用于大多数 PDF。 But It cannot extract text correctly from Linearized PDF: Extracted text但它无法从线性化 PDF 中正确提取文本：提取的文本

Is there a way to extract text from Linearized PDF by pdfbox or using other tools?有没有办法通过 pdfbox 或使用其他工具从线性化 PDF 中提取文本？

Here is a Linearized PDF example这是一个线性化 PDF 示例

2 个解决方案

The problem with your example PDF is not that it's linearized.您的示例 PDF 的问题不在于它是线性化的。

The actual problem is that most fonts in your PDF are missing the necessary information for text extraction: They neither have ToUnicode maps nor useful Encoding s, and they are Type 3 fonts which prevents the retrieval of additional information from an associated font program or CIDFont dictionary. The actual problem is that most fonts in your PDF are missing the necessary information for text extraction: They neither have ToUnicode maps nor useful Encoding s, and they are Type 3 fonts which prevents the retrieval of additional information from an associated font program or CIDFont dictionary .

In particular such PDFs usually are explicitly generated to prevent text extraction by regular text extractors.特别是，通常会显式生成此类 PDF，以防止常规文本提取器提取文本。

For such PDFs essentially your only option is to try OCR.对于此类 PDF，您唯一的选择就是尝试 OCR。

Linearized should not be an issue for text extraction but not all plain text is as you may expect, since some constructs cannot be described in plain text.线性化不应该是文本提取的问题，但并非所有纯文本都如您所料，因为某些结构无法用纯文本描述。 So it is not clear which part you showing in source file but simple PDFtotext seem to not have a problem.因此尚不清楚您在源文件中显示的哪一部分，但简单的 PDFtotext 似乎没有问题。 I would avoid generic OCR as likely to add errors.我会避免通用 OCR 可能会添加错误。 Maths formulas are best converted by dedicated equation converters that work their OCR on image snippets.数学公式最好通过专门的公式转换器进行转换，这些转换器在图像片段上进行 OCR。 https://mathpix.com/ Snip is the commercial market leader, and few competitors see https://www.sciaccess.net/en/InftyReader/ https://mathpix.com/ Snip 是商业市场的领导者，很少有竞争对手看到https://www.sciaccess.net/en/InftyReader/

Here we can see the infty isolated svg formula from the pdf and its OCR extracted characters Yj= γEj 1/θ(Ej)θ−1 .... which is meaningless for this type of reversal.在这里，我们可以看到从 pdf 及其 OCR 提取的字符Yj= γEj 1/θ(Ej)θ−1 .... 这对于这种类型的反转毫无意义。 A copy of math tables or formulas as images is usually the best possible solution otherwise the result is highly likely to be corrupted.作为图像的数学表或公式的副本通常是最好的解决方案，否则结果很可能被破坏。 Note how some braces are recognized but not some critical ones.注意一些大括号是如何被识别的，但不是一些关键的。

We can see why that will happen by looking at the outline of that area, looks like CMEX10 is one of the worst defined as text.我们可以通过查看该区域的轮廓来了解为什么会发生这种情况，看起来 CMEX10 是最糟糕的文本定义之一。 θ+ γLj 1/θ(Lj)θ−1