简体   繁体   中英

How to read digits from an image using pytesseract

I'm trying to read the digits from this image:

数

Using pytesseract with these settings:

custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

This is the output:

((E ST7 [71aT6T2 ] THETOGOG5 15 [8)

Whitelisting only integers, as well as changing your psm provides much better results. You also need to remove carriage returns, and white space. Below is code that does that.

import pytesseract
import re
from PIL import Image

#Open image
im = Image.open("numbers.png")

#Define configuration that only whitelists number characters
custom_config = r'--oem 3 --psm 11 -c tessedit_char_whitelist=0123456789'

#Find the numbers in the image
numbers_string = pytesseract.image_to_string(im, config=custom_config)

#Remove all non-number characters
numbers_int = re.sub(r'[a-z\n]', '', numbers_string.lower())

#print the output
print(numbers_int)

The result of the code on your image is: '31477423353'

Unfortunately, a few numbers are still missing. I tried some experimentation, and downloaded your image and erased the grid.

在此处输入图像描述

After removing the grid and executing the code again, pytesseract produces a perfect result: '314774628300558'

So you might try to think about how you can remove the grid programmatically. There are alternatives to pytesseract, but regardless you will get better output with the text isolated in the image.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM