简体   繁体   English

在使用 Tesseract 进行文本识别之前预处理图像

[英]Pre-processing image before text recognition with Tesseract

I have a scanned page, that I'm trying to identify and parse the numbers from the image (line by line).我有一个扫描的页面,我正在尝试识别和解析图像中的数字(逐行)。 In order to do that, I'm using Python Pytessarect, with the following code:为此,我使用 Python Pytessarect,代码如下:

img = cv2.imread('image.jpg',0)
ret,thresh1 = cv2.threshold(img,110,255,cv2.THRESH_TOZERO)
scan_config = r'--oem 3 --psm 6'
extracted_text = pytesseract.image_to_string(thresh1, config=scan_config)

input image.jpg:输入图像.jpg:

输入图像.jpg

Unfortunately, the result is not satisfying because as you can see, the digits on the 4th column are partly erased (a human eye can identify the digits, but a threshold algorithm makes it even worse):不幸的是,结果并不令人满意,因为如您所见,第 4 列的数字被部分擦除(人眼可以识别这些数字,但阈值算法使情况变得更糟):

006442000180
006354924010
005900000461
062891556156
006*3*00000261
006900000261

Someone has an idea of how to pre-process the image so that the algorithm will be able to identify even the party erased digits?有人知道如何预处理图像,以便算法能够识别甚至是被擦除的数字吗? By the way, the 2nd argument of the threshold function is hardcoded (110) and it probably won't match every image out there, it depends on the photo's quality, is there a way to generate the value dynamically or to use alternative to the threshold approach (maybe using OpenCV filters)?顺便说一下,阈值 function 的第二个参数是硬编码的 (110),它可能不会匹配所有图像,这取决于照片的质量,有没有办法动态生成值或使用替代方法阈值方法(可能使用 OpenCV 过滤器)?

tesseract PzCox.png - --dpi 72 --psm 6

produce this with (English) model best :用(英语) model 制作这个最好

006442000180
006354924010
005300000461
062891556156
006300000261
006300000261

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM