使用 OCR [pdfbox] 检测是否从扫描文档创建了 PDF

Question

I would like to know if a PDF was created from a scanned document using OCR.我想知道 PDF 是否是使用 OCR 从扫描文档创建的。

To make the text from the scanned document selectable, I guess the same text is written using a transparent color, a special font, ...为了使扫描文档中的文本可供选择，我猜使用透明颜色、特殊字体、...

I'm using pdfbox and I looked at the font, the color, and many other properties and I didn't find anything special.我正在使用 pdfbox，我查看了字体、颜色和许多其他属性，但没有发现任何特别之处。

Answer 1

In my case the text rendering mode was set to "Neither fill nor stroke text".在我的情况下，文本渲染模式设置为“既不填充也不描边文本”。

pdfbox code: pdfbox 代码：

getGraphicsState().getTextState().getRenderingMode() == PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT

Answer 2

In most cases, the original image is still present, and the OCRd text is invisible underneath.在大多数情况下，原始图像仍然存在，并且 OCRd 文本在下方不可见。

So, one possibility would be finding out whether there is a picture covering all the area with text.因此，一种可能性是找出是否有一张图片覆盖了文本的所有区域。

Another possibility would be looking at the fonts and make some smart decisions based on them另一种可能性是查看字体并根据它们做出一些明智的决定

使用 OCR [pdfbox] 检测是否从扫描文档创建了 PDF

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-06-16 09:22:01

解决方案2
0 2014-06-12 15:46:18

使用 OCR [pdfbox] 检测是否从扫描文档创建了 PDF

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-06-16 09:22:01

解决方案2 0 2014-06-12 15:46:18

解决方案1
2 已采纳 2014-06-16 09:22:01

解决方案2
0 2014-06-12 15:46:18