[英]Improving OCR performance on multi-paragraph scans
I'm working on a project that involves extracting text scientific papers stored in PDF format. 我正在开展一个涉及提取以PDF格式存储的文本科学论文的项目。 For most papers, this is accomplished quite easily using PDFMiner, but some older papers store their text as large images.
对于大多数论文而言,使用PDFMiner很容易实现,但一些较旧的论文将其文本存储为大图像。 In essence, a paper is scanned and that image file (typically PNG or JPEG) comprises the entire page.
本质上,扫描纸张并且该图像文件(通常是PNG或JPEG)包括整个页面。
I tried using the Tesseract engine through it's python-tesseract bindings, but the results are quite disappointing. 我尝试通过它的python-tesseract绑定使用Tesseract引擎,但结果非常令人失望。
Before diving into the questions I have with this library, I would like to mention that I'm open to suggestions for OCR libraries. 在深入研究我对这个库的问题之前,我想提一下我对OCR库的建议持开放态度。 There seem to be few native python solutions.
似乎很少有本机python解决方案。
Here is one such image (JPEG) on which I am trying to extract text. 这是一个这样的图像(JPEG),我试图提取文本。 I the exact code provided in the example snippets on the python-tesseract google code page I linked to above.
我在上面链接的python-tesseract google代码页上的示例代码段中提供了确切的代码。 I should mention that the documentation is a bit sparse, so it's quite possible that one of the many options in my code is misconfigured.
我应该提一下,文档有点稀疏,所以我的代码中很多选项中的一个很可能配置错误。 Any advice (or links to in-depth tutorials) would be much appreciated.
任何建议(或深入教程的链接)将不胜感激。
Here is the output from my attempt at OCR. 这是我尝试OCR的输出。
My questions are as follows: 我的问题如下:
EDIT: For simplicity, here is the code I used. 编辑:为简单起见,这是我使用的代码。
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
mImgFile = "eurotext.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result
And here is the alterative code (whose results are not shown in this question, although the performance appears to be quite similar). 这里是替代代码(其结果未在此问题中显示,尽管性能似乎非常相似)。
import cv2.cv as cv
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
Could anyone explain the differences between these two snippets? 任何人都可以解释这两个片段之间的差异吗?
Tesseract is very good on clean input text (like your example) if you tinker a bit. 如果你修补一下,Tesseract对干净的输入文本非常好(就像你的例子)。 some suggestions:
一些建议:
I'll check back here to see if I can help more but do join the tesseract mailing list, they're really helpful. 我会回到这里查看我是否可以提供更多帮助但是加入tesseract邮件列表,他们真的很有帮助。
Sidenote - I have some patches for pytesseract which I ought to publish for getting characters & confidences & words via the API (which wasn't possible a couple of months back). 旁注 - 我有一些pytesseract的补丁,我应该发布这些补丁,通过API获取字符,信心和单词(几个月前不可能)。 Shout if they might be useful.
如果他们可能有用,请喊。
The first example reads the file as a buffer and then relay it to tesseract-ocr without doing any modification while the second one reads file into opencv format which will then allow you to do some image touch up like changing the aspect ratio, gray scale and etc using the cv library. 第一个示例将文件作为缓冲区读取,然后将其传递给tesseract-ocr而不进行任何修改,而第二个示例将文件读取为opencv格式,这将允许您进行一些图像修改,如更改宽高比,灰度和等等使用cv库。 The second method is very useful if u want to do the image manipulation before passing the image to tesseract.
如果你想在将图像传递给tesseract之前进行图像处理,第二种方法非常有用。
BTW, I am the owner of python-tesseract. 顺便说一句,我是python-tesseract的所有者。 If u want to ask question, you could always welcome to forward your question to http://code.google.com/p/python-tesseract
如果您想提问,可以随时欢迎将问题转发给http://code.google.com/p/python-tesseract
Joe 乔
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.