简体   繁体   English

提高多段扫描的OCR性能

[英]Improving OCR performance on multi-paragraph scans

I'm working on a project that involves extracting text scientific papers stored in PDF format. 我正在开展一个涉及提取以PDF格式存储的文本科学论文的项目。 For most papers, this is accomplished quite easily using PDFMiner, but some older papers store their text as large images. 对于大多数论文而言,使用PDFMiner很容易实现,但一些较旧的论文将其文本存储为大图像。 In essence, a paper is scanned and that image file (typically PNG or JPEG) comprises the entire page. 本质上,扫描纸张并且该图像文件(通常是PNG或JPEG)包括整个页面。

I tried using the Tesseract engine through it's python-tesseract bindings, but the results are quite disappointing. 我尝试通过它的python-tesseract绑定使用Tesseract引擎,但结果非常令人失望。

Before diving into the questions I have with this library, I would like to mention that I'm open to suggestions for OCR libraries. 在深入研究我对这个库的问题之前,我想提一下我对OCR库的建议持开放态度。 There seem to be few native python solutions. 似乎很少有本机python解决方案。

Here is one such image (JPEG) on which I am trying to extract text. 是一个这样的图像(JPEG),我试图提取文本。 I the exact code provided in the example snippets on the python-tesseract google code page I linked to above. 我在上面链接的python-tesseract google代码页上的示例代码段中提供了确切的代码。 I should mention that the documentation is a bit sparse, so it's quite possible that one of the many options in my code is misconfigured. 我应该提一下,文档有点稀疏,所以我的代码中很多选项中的一个很可能配置错误。 Any advice (or links to in-depth tutorials) would be much appreciated. 任何建议(或深入教程的链接)将不胜感激。

Here is the output from my attempt at OCR. 是我尝试OCR的输出。

My questions are as follows: 我的问题如下:

  1. Is there anything suboptimal in the code I'm using? 我正在使用的代码中有什么不是最理想的吗? Is there a better way of doing this? 有没有更好的方法呢? A different library perhaps? 也许是另一个图书馆?
  2. What kind of preprocessing can I perform to improve detection? 我可以执行哪种预处理来改善检测? The images are all B&W, but should I perhaps set a threshold and set anything above it to a single-value black color and everything below it to a null-value white color? 这些图像都是B&W,但是我应该设置一个阈值并将其上方的任何内容设置为单值黑色,并将其下方的所有内容设置为空值白色? Anything else? 还要别的吗?
  3. A more specific question: can performance be improved by performing OCR on single words? 一个更具体的问题:通过对单个单词执行OCR可以提高性能吗? If so, can anyone suggest a way of delimiting single words in an image file (eg: the one linked above) and extracting them into separate images which can be treated independently? 如果是这样,任何人都可以建议一种在图像文件中划分单个单词的方法(例如:上面链接的单词)并将它们提取到可以独立处理的单独图像中吗?
  4. Can the presence of graphs and other images embedded in the PDF page image interfere with OCR? 嵌入在PDF页面图像中的图形和其他图像是否会干扰OCR? Should I remove these? 我应该删除这些吗? If so, can anyone suggest a method for removing them automatically? 如果是这样,有人可以建议一种自动删除它们的方法吗?

EDIT: For simplicity, here is the code I used. 编辑:为简单起见,这是我使用的代码。

import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

mImgFile = "eurotext.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result

And here is the alterative code (whose results are not shown in this question, although the performance appears to be quite similar). 这里是替代代码(其结果未在此问题中显示,尽管性能似乎非常相似)。

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()

Could anyone explain the differences between these two snippets? 任何人都可以解释这两个片段之间的差异吗?

Tesseract is very good on clean input text (like your example) if you tinker a bit. 如果你修补一下,Tesseract对干净的输入文本非常好(就像你的例子)。 some suggestions: 一些建议:

  • Before automating, start with tesseract at the command line 在自动化之前,请在命令行中使用tesseract
  • Restrict your character set if possible (eg take a look in /usr/local/share/tessdata/configs at ./digits - configure it for English characters upper/lower case etc) and provide it as a command line argument 如果可能,限制你的字符集(例如,查看./digits中的/ usr / local / share / tessdata / configs - 将其配置为英文字符大写/小写等)并将其作为命令行参数提供
  • Only use PNG or TIFF images (TIFF for older versions) as JPG introduces artefacts 仅使用PNG或TIFF图像(旧版本的TIFF),因为JPG引入了人工制品
  • Upsample the image so your text is larger than the current tiny font. 对图像进行采样,使您的文本大于当前的小字体。 Tesseract lines >10 pixel high characters (if memory serves), it certainly performs worse with tiny characters Tesseract线> 10个像素高的字符(如果内存服务),它对于微小的字符肯定表现更差
  • No need to do thresholding if you're bi-level already but it won't hurt if you do and you can see exactly the same image that tesseract will see 如果你已经是双级别,则无需进行阈值处理,但如果你这样做就不会受到影响,你可以看到完全相同的图像,tesseract将会看到

I'll check back here to see if I can help more but do join the tesseract mailing list, they're really helpful. 我会回到这里查看我是否可以提供更多帮助但是加入tesseract邮件列表,他们真的很有帮助。

Sidenote - I have some patches for pytesseract which I ought to publish for getting characters & confidences & words via the API (which wasn't possible a couple of months back). 旁注 - 我有一些pytesseract的补丁,我应该发布这些补丁,通过API获取字符,信心和单词(几个月前不可能)。 Shout if they might be useful. 如果他们可能有用,请喊。

The first example reads the file as a buffer and then relay it to tesseract-ocr without doing any modification while the second one reads file into opencv format which will then allow you to do some image touch up like changing the aspect ratio, gray scale and etc using the cv library. 第一个示例将文件作为缓冲区读取,然后将其传递给tesseract-ocr而不进行任何修改,而第二个示例将文件读取为opencv格式,这将允许您进行一些图像修改,如更改宽高比,灰度和等等使用cv库。 The second method is very useful if u want to do the image manipulation before passing the image to tesseract. 如果你想在将图像传递给tesseract之前进行图像处理,第二种方法非常有用。

BTW, I am the owner of python-tesseract. 顺便说一句,我是python-tesseract的所有者。 If u want to ask question, you could always welcome to forward your question to http://code.google.com/p/python-tesseract 如果您想提问,可以随时欢迎将问题转发给http://code.google.com/p/python-tesseract

Joe

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM