简体   繁体   English

使用 OCR [pdfbox] 检测是否从扫描文档创建了 PDF

[英]Detect if a PDF is created from a scanned document using OCR [pdfbox]

I would like to know if a PDF was created from a scanned document using OCR.我想知道 PDF 是否是使用 OCR 从扫描文档创建的。

To make the text from the scanned document selectable, I guess the same text is written using a transparent color, a special font, ...为了使扫描文档中的文本可供选择,我猜使用透明颜色、特殊字体、...

I'm using pdfbox and I looked at the font, the color, and many other properties and I didn't find anything special.我正在使用 pdfbox,我查看了字体、颜色和许多其他属性,但没有发现任何特别之处。

In my case the text rendering mode was set to "Neither fill nor stroke text".在我的情况下,文本渲染模式设置为“既不填充也不描边文本”。

pdfbox code: pdfbox 代码:

getGraphicsState().getTextState().getRenderingMode() == PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT

In most cases, the original image is still present, and the OCRd text is invisible underneath.在大多数情况下,原始图像仍然存在,并且 OCRd 文本在下方不可见。

So, one possibility would be finding out whether there is a picture covering all the area with text.因此,一种可能性是找出是否有一张图片覆盖了文本的所有区域。

Another possibility would be looking at the fonts and make some smart decisions based on them另一种可能性是查看字体并根据它们做出一些明智的决定

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM