简体   繁体   English

如何使用 pytesseract 从图像中读取数字

[英]How to read digits from an image using pytesseract

I'm trying to read the digits from this image:我正在尝试从这张图片中读取数字:


Using pytesseract with these settings:使用具有这些设置的pytesseract

custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

This is the output:这是 output:

((E ST7 [71aT6T2 ] THETOGOG5 15 [8)

Whitelisting only integers, as well as changing your psm provides much better results.仅将整数列入白名单以及更改 psm 可提供更好的结果。 You also need to remove carriage returns, and white space.您还需要删除回车符和空格。 Below is code that does that.下面是执行此操作的代码。

import pytesseract
import re
from PIL import Image

#Open image
im = Image.open("numbers.png")

#Define configuration that only whitelists number characters
custom_config = r'--oem 3 --psm 11 -c tessedit_char_whitelist=0123456789'

#Find the numbers in the image
numbers_string = pytesseract.image_to_string(im, config=custom_config)

#Remove all non-number characters
numbers_int = re.sub(r'[a-z\n]', '', numbers_string.lower())

#print the output

The result of the code on your image is: '31477423353'图片上代码的结果是:'31477423353'

Unfortunately, a few numbers are still missing.不幸的是,仍然缺少一些数字。 I tried some experimentation, and downloaded your image and erased the grid.我尝试了一些实验,下载了你的图像并删除了网格。


After removing the grid and executing the code again, pytesseract produces a perfect result: '314774628300558'删除网格并再次执行代码后,pytesseract 产生了完美的结果:'314774628300558'

So you might try to think about how you can remove the grid programmatically.因此,您可能会尝试考虑如何以编程方式删除网格。 There are alternatives to pytesseract, but regardless you will get better output with the text isolated in the image.有 pytesseract 的替代品,但无论如何你会得到更好的 output 与图像中隔离的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM