简体   繁体   English

OCR应用前的图像清洁

[英]Image cleaning before OCR application

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. 我在过去的几个小时里一直在试验PyTesser,这是一个非常好的工具。 Couple of things I noticed about the accuracy of PyTesser: 我注意到有关PyTesser准确性的一些事情:

  1. File with icons, images and text - 5-10% accurate 带有图标,图像和文本的文件 - 准确率为5-10%
  2. File with only text(images and icons erased) - 50-60% accurate 仅包含文本的文件(图像和图标已擦除) - 准确率为50-60%
  3. File with stretching(And this is the best part) - Stretching file in 2) above on x or y axis increased the accuracy by 10-20% 带拉伸的文件(这是最好的部分) - 在x或y轴上面的2)拉伸文件将精度提高了10-20%

So apparently Pytesser does not take care of font dimension or image stretching. 显然Pytesser并不关心字体尺寸或图像拉伸。 Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language? 虽然有很多关于图像处理和OCR的理论需要阅读,但是在应用PyTesser或其他库之前,是否有任何标准的图像清理程序(除了擦除图标和图像),而不管语言是什么?

........... ...........

Wow, this post is quite old now. 哇,这篇文章现在已经很老了。 I started my research again on OCR these last couple of days. 在过去的几天里,我再次开始研究OCR。 This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. 这次我扔掉了PyTesser并使用了Tesseract引擎和ImageMagik。 Coming straight to the point, this is what I found: 直截了当地说,这就是我发现的:

1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.

So the Tesseract Engine is without doubt the best open source OCR engine in the market. 因此,Tesseract Engine毫无疑问是市场上最好的开源OCR引擎。 No prior image cleaning was required here. 此处不需要事先清洁图像。 The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. 需要注意的是,它不适用于包含大量嵌入图像的文件,而且我没有找到一种方法来训练Tesseract忽略它们。 Also the text layout and formatting in the image makes a big difference. 此外,图像中的文本布局和格式也有很大的不同。 It works great with images with just text. 它只适用于带有文本的图像。 Hope this helped. 希望这有帮助。

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images. 不确定你的意图是否用于商业用途,但是如果你在一堆像图像上执行OCR,这会产生奇迹。

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

ORIGINAL 原版的 原版的

After Pre-Processing with given arguments. 在使用给定参数进行预处理之后。

在使用给定参数进行预处理之后。

As it turns out, tesseract wiki has an article that answers this question in best way I can imagine: 事实证明, tesseract wiki有一篇文章以我能想象的最佳方式回答这个问题:


(initial answer, just for the record) (初步答案,仅供记录)

I haven't used PyTesser , but I have done some experiments with tesseract (version: 3.02.02 ). 我没有使用PyTesser ,但我已经用tesseract做了一些实验(版本: 3.02.02 )。

If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image. 如果在彩色图像上调用tesseract,则它首先应用全局Otsu方法对其进行二值化,然后在二进制(黑白)图像上运行实际字符识别。

Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html 图片来自: http//scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html

大津的门槛图

As it can be seen, 'global Otsu' may not always produce desirable result. 可以看出,“全球大津”可能并不总能产生理想的结果。

To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image. 为了更好地理解tesseract'看到'是将Otsu的方法应用于您的图像,然后查看生成的图像。

In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract . 总之:提高识别率的最直接的方法是自己对图像进行二值化(最有可能通过反复试验找到好的阈值) ,然后将这些二值化图像传递给tesseract

Somebody was kind enough to publish api docs for tesseract , so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix 有人非常友好地发布了tesseract的api文档 ,因此可以验证以前有关处理管道的语句: ProcessPage - > GetThresholdedImage - > ThresholdToPix - > OtsuThresholdRectToPix

I know it's not a perfect answer. 我知道这不是一个完美的答案。 But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. 但我想与您分享我从PyCon 2013中看到的可能适用的视频。 It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem. 它有点缺乏实现细节,但对于如何解决/改善您的问题可能只是一些指导/启发。

Link to Video 链接到视频

Link to Presentation 链接到演示文稿

And if you do decide to use ImageMagick to pre-process your source images a little. 如果您决定使用ImageMagick预处理源图像。 Here is question that points you to nice python bindings for it. 是一个问题,指出你很好的python绑定。

On a side note. 在旁注。 Quite an important thing with Tesseract. Tesseract非常重要。 You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being. 你需要训练它,否则它不会像它能够那样好/准确。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM