[英]How to extract text or numbers from images using python
I want to extract text (mainly numbers) from images like this我想从这样的图像中提取文本(主要是数字)
I tried this code我试过这个代码
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open('1.jpg')
text = pytesseract.image_to_string(img, lang='eng')
print(text)
but all i get is this (hE PPAR)但我得到的只是这个(hE PPAR)
When performing OCR, it is important to preprocess the image so the desired text to detect is in black with the background in white .执行 OCR 时,重要的是对图像进行预处理,以便要检测的文本为黑色,背景为白色。 To do this, here's a simple approach using OpenCV to Otsu's threshold the image which will result in a binary image.
要做到这一点,这里有一个简单的方法,使用 OpenCV 对 Otsu 的图像阈值,这将产生一个二值图像。 Here's the image after preprocessing:
这是预处理后的图像:
We use the --psm 6
configuration setting to treat the image as a uniform block of text.我们使用
--psm 6
配置设置将图像视为统一的文本块。 Here's other configuration options you can try.以下是您可以尝试的其他配置选项。 Result from Pytesseract
Pytesseract 的结果
01153521976
01153521976
Code代码
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.