简体   繁体   English

为什么Pytesseract无法识别黑色的纯白色文本?

[英]Why can't Pytesseract recognize plain white text on black?

I have a lot of images like below that I need to use pytesseract with to grab the white text: 我有很多类似下面的图像,我需要使用pytesseract来抓取白色文本:

在此处输入图片说明

I use the following code, but the results are not impressive: 我使用以下代码,但结果并不令人满意:

import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('topLine.png')
print pytesseract.image_to_string(im)

Results: 结果:

Rouse Services | Renta Dastbonrd | Blei Rental



RJ |G | B (mmm @

So I thought the reason was non-text inside the image. 所以我认为原因是图片中没有文字。 I cropped the part of the image with the most important text to me and ran the same code against it: 我用最重要的文本裁剪了图像的一部分,并对它运行了相同的代码:

在此处输入图片说明

However, all I got was blank. 但是,我所得到的只是空白。 Pytesseract didn't find any text at all. Pytesseract根本找不到任何文本。 What am I doing wrong? 我究竟做错了什么?

To answer your original question is I believe their training dataset is only on black text white background so its not surprising the machine learning algorithm wont pick up the inverse. 要回答您的原始问题,我相信他们的训练数据集仅在白色背景上的黑色文本上出现,因此机器学习算法不会采用逆算法也就不足为奇了。 Now for the solution, if the black box with white text is in a specific spot in the images every time, i would just crop it out, inverse it, then put it back in the same spot. 现在,对于解决方案,如果每次带有黑底白字的黑框都位于图像中的特定位置,则我将其裁剪,反转,然后将其放回同一位置。 otherwise you can use erode/dilate tools with a customized kernel to find these black boxes and essentially create a masking over that part of the image. 否则,您可以使用带有自定义内核的腐蚀/扩张工具来找到这些黑匣子,并在图像的那部分上创建遮罩。 Using this masking you can say hey python, here is a black box with white text inverse it. 使用此蒙版,您可以说嘿python,这是一个黑色框,上面带有白色文字。 In my experience, pytesseract has always needed at least some image processing (if not alot) to get good output, but even with the most screwed up images i have been able to get accuracies above 93%. 根据我的经验,pytesseract一直需要至少一些图像处理(如果不是很多的话)才能获得良好的输出,但是即使使用最搞砸的图像,我也能够获得93%以上的精度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM