[英]Python - Pytesseract extracts incorrect text from image
I used the below code in Python to extract text from image, 我在Python中使用以下代码从图像中提取文本,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
But the output is incorrect.. The input file is, 但是输出不正确。输入文件是
The output received is '0001' instead of 'D001' 收到的输出是“ 0001”而不是“ D001”
The output received is '3001' instead of 'B001' 收到的输出是“ 3001”而不是“ B001”
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters] 需要什么代码更改才能从图像中检索正确的字符,还训练pytesseract返回图像中所有字体类型的正确字符[包括粗体字符]
@Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact. @Maaaaa指出了Tessearact无法正确识别文本的确切原因。
But still you can improve your final output by applying some post processing steps on the tesseract output. 但是,仍然可以通过在tesseract输出上应用一些后处理步骤来提高最终输出。 Here are a few points that you can think about and use them if it helps:
如果有帮助,您可以考虑并使用以下几点:
Try different config parameters in below line 在下面的行中尝试不同的配置参数
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
Like as shown below: 如下图所示:
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')
Try to change the psm value and compare the results 尝试更改psm值并比较结果
-- Good Luck -- - 祝好运 -
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.