为什么不能用 PIL 和 pytesseract 获取字符串？

Question

It is a simple Optical Character Recognition (OCR) program in Python 3 to get string, I have uploaded the target gif file here, please download it and save it as /tmp/target.gif .这是 Python 3 中一个简单的光学字符识别 (OCR) 程序来获取字符串，我在这里上传了目标 gif 文件，请下载并保存为/tmp/target.gif 。

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.gif')))

I paste all the error info here, please fix it to get the characters from image.我在这里粘贴了所有错误信息，请修复它以从图像中获取字符。

/usr/lib/python3/dist-packages/PIL/Image.py:925: UserWarning: Couldn't allocate palette entry for transparency
  "for transparency")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 309, in image_to_string
    }[output_type]()
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 308, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 208, in run_and_get_output
    temp_name, input_filename = save_image(image)
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 136, in save_image
    image.save(input_file_name, format=img_extension, **image.info)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1728, in save
    save_handler(self, fp, filename)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 407, in _save
    _get_local_header(fp, im, (0, 0), flags)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 441, in _get_local_header
    transparency = int(transparency)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

I convert it with convert command in bash.我用 bash 中的convert命令转换它。

convert  "/tmp/target.gif"   "/tmp/target.jpg"

I show /tmp/target.gif and /tmp/target.jpg here.我在这里显示/tmp/target.gif和/tmp/target.jpg 。

Then execute the above python code again.然后再次执行上面的python代码。

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.jpg')))

Nothing can i get with the pytesseract.image_to_string(Image.open('/tmp/target.jpg')) ,i get blank character. pytesseract.image_to_string(Image.open('/tmp/target.jpg'))我什么也得不到，我得到了空白字符。

For Trenton_M's code:对于 Trenton_M 的代码：

>>> img1 = remove_noise_and_smooth(r'/tmp/target.jpg')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in remove_noise_and_smooth
AttributeError: 'NoneType' object has no attribute 'astype'
Thalish Sajeed

For Thalish Sajeed's code:对于 Thalish Sajeed 的代码：

Omit the error info caused by print(pytesseract.image_to_string(Image.open(filename))) .省略由print(pytesseract.image_to_string(Image.open(filename)))引起的错误信息。

Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> import pytesseract
>>> import matplotlib.pyplot as plt
>>> import cv2
>>> import numpy as np
>>> 
>>> 
>>> def display_image(filename, length_box=60, width_box=30):
...     if type(filename) == np.ndarray:
...         image = filename
...     else:
...         image = cv2.imread(filename)
...     plt.figure(figsize=(length_box, width_box))
...     plt.imshow(image, cmap="gray")
... 
>>> 
>>> filename = r"/tmp/target.jpg"
>>> display_image(filename)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in display_image
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/pyplot.py", line 2699, in imshow
    None else {}), **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/__init__.py", line 1810, in inner
    return func(ax, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/axes/_axes.py", line 5494, in imshow
    im.set_data(X)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/image.py", line 634, in set_data
    raise TypeError("Image data cannot be converted to float")
TypeError: Image data cannot be converted to float
>>>

@Thalish Sajeed,Why i got 9244K instead of 0244k with your code? @Thalish Sajeed，为什么我的代码是9244K而不是0244k ？ Here is my tested sample file.这是我测试过的示例文件。

The extracted string.提取的字符串。

@Trenton_M,correct a little typo and loss in your code,and delete the line plt.show() as your suggestion. @Trenton_M，更正代码中的一些错字和丢失，并根据您的建议删除plt.show()行。

>>> import cv2,pytesseract
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> 
>>> 
>>> def image_smoothening(img):
...     ret1, th1 = cv2.threshold(img, 88, 255, cv2.THRESH_BINARY)
...     ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
...     blur = cv2.GaussianBlur(th2, (5, 5), 0)
...     ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
...     return th3
... 
>>> 
>>> def remove_noise_and_smooth(file_name):
...     img = cv2.imread(file_name, 0)
...     filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 41)
...     kernel = np.ones((1, 1), np.uint8)
...     opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
...     closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
...     img = image_smoothening(img)
...     or_image = cv2.bitwise_or(img, closing)
...     return or_image
... 
>>> 
>>> cv2_thresh_list = [cv2.THRESH_BINARY, cv2.THRESH_TRUNC, cv2.THRESH_TOZERO]
>>> fn = r'/tmp/target.jpg'
>>> img1 = remove_noise_and_smooth(fn)
>>> img2 = cv2.imread(fn, 0)
>>> for i, img in enumerate([img1, img2]):
...     img_type = {0: 'Preprocessed Images\n',
...                 1: '\nUnprocessed Images\n'}
...     print(img_type[i])
...     for item in cv2_thresh_list:
...         print('Thresh: {}'.format(str(item)))
...         _, thresh = cv2.threshold(img, 127, 255, item)
...         plt.imshow(thresh, 'gray')
...         f_name = '{0}.jpg'.format(str(item))
...         plt.savefig(f_name)
...         print('OCR Result: {}\n'.format(pytesseract.image_to_string(f_name)))

... Preprocessed Images ... 预处理图像

In my console ,all the output info are as following:在我的控制台中，所有输出信息如下：

Thresh: 0
<matplotlib.image.AxesImage object at 0x7fbc2519a6d8>
OCR Result: 10
15
20 

Edﬁﬁ
10
2 o 30 40 so
so

Thresh: 2
<matplotlib.image.AxesImage object at 0x7fbc255e7eb8>
OCR Result: 10
15
20
Edﬁﬁ
10
2 o 30 40 so
so
Thresh: 3
<matplotlib.image.AxesImage object at 0x7fbc25452fd0>
OCR Result: 10
15
20
Edﬁﬁ
10
2 o 30 40 so
so
Unprocessed Images
Thresh: 0
<matplotlib.image.AxesImage object at 0x7fbc25464c88>
OCR Result: 10
15
20
Thresh: 2
<matplotlib.image.AxesImage object at 0x7fbc254520f0>
OCR Result: 10
15
2o
2o
30 40 50
Thresh: 3
<matplotlib.image.AxesImage object at 0x7fbc1e1968d0>
OCR Result: 10
15
20

Where is the string 0244R ?字符串0244R在哪里？

Answer 1

Let's start with the JPG image, because pytesseract has issues operating on GIF image formats.让我们从 JPG 图像开始，因为 pytesseract 对 GIF 图像格式的操作存在问题。 reference参考

filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))

Let's try to breakdown the issues here.让我们尝试分解这里的问题。

Your image is too noisy for tesseract engine to identify the letters, We use some simple image processing techniques such as grayscaling and thresholding to remove some noise from the image.您的图像噪声太大，tesseract 引擎无法识别字母，我们使用一些简单的图像处理技术，例如灰度和阈值处理来去除图像中的一些噪声。

Then when we send it to the OCR engine, we see that the letters are captured more accurately.然后当我们将它发送到 OCR 引擎时，我们看到字母被更准确地捕获。

You can find my notebook where I tested this out if you follow this github link如果你按照这个github 链接，你可以找到我测试过的笔记本

Edit - I have updated the notebook with some additional image cleaning techniques.编辑 - 我已经用一些额外的图像清理技术更新了笔记本。 The source image is too noisy for tesseract to work directly out of the box on the image.源图像噪声太大，tesseract 无法直接在图像上开箱即用。 You need to use image cleaning techniques.您需要使用图像清理技术。

You can vary the thresholding parameters or swap out gaussian blur for some other technique until you get your desired results.您可以改变阈值参数或将高斯模糊换成其他一些技术，直到获得所需的结果。

If you are looking to run OCR on noisy images - please check out commercial OCR providers such as google-cloud-vision .如果您希望在嘈杂的图像上运行 OCR - 请查看商业 OCR 提供商，例如google-cloud-vision 。 They provide 1000 OCR calls free per month.他们每月免费提供 1000 次 OCR 呼叫。

Answer 2

First: make certain you've installed the Tesseract program (not just the python package)首先：确保你已经安装了Tesseract 程序（不仅仅是 python 包）

Jupyter Notebook of Solution : Only the image passed through remove_noise_and_smooth is successfully translated with OCR. Jupyter Notebook of Solution : 只有通过remove_noise_and_smooth的图片才能通过OCR 成功翻译。

When attempting to convert image.gif, TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple' is generated.尝试转换 image.gif 时， TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'生成TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple' 。

Rename image.gif to image.jpg, the TypeError is generated将image.gif重命名为image.jpg，产生TypeError

Open image.gif and 'save as' image.jpg, the output is blank, which means the text wasn't recognized.打开 image.gif 并“另存为”image.jpg，输出为空白，表示无法识别文本。

from PIL import Image
import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
# your path may be different than mine
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"

imgo = Image.open('0244R_clean.jpg')

print(pytesseract.image_to_string(imgo))

No text is recognized from the original image, so it may require post-processing to clean prior to OCR无法从原始图像中识别出文本，因此可能需要在 OCR 之前进行后处理以进行清理
I created a clean image, which pytesseract extracts the text from without issue.我创建了一个干净的图像，pytesseract 可以毫无问题地从中提取文本。 The image is included below, so you can test it with your own code to verify functionality.该图像包含在下面，因此您可以使用自己的代码对其进行测试以验证其功能。

Add Post-Processing添加后处理

Improve Accuracy of OCR using Image Preprocessing 使用图像预处理提高 OCR 的准确性

OpenCV OpenCV

import cv2
import numpy as np
import matplotlib.pyplot as plt


def image_smoothening(img):
    ret1, th1 = cv2.threshold(img, 88, 255, cv2.THRESH_BINARY)
    ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    blur = cv2.GaussianBlur(th2, (5, 5), 0)
    ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return th3


def remove_noise_and_smooth(file_name):
    img = cv2.imread(file_name, 0)
    filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 41)
    kernel = np.ones((1, 1), np.uint8)
    opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
    closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
    img = image_smoothening(img)
    or_image = cv2.bitwise_or(img, closing)
    return or_image


cv2_thresh_list = [cv2.THRESH_BINARY, cv2.THRESH_TRUNC, cv2.THRESH_TOZERO]

fn = r'/tmp/target.jpg'
img1 = remove_noise_and_smooth(fn)
img2 = cv2.imread(fn, 0)
for i, img in enumerate([img1, img2]):
    img_type = {0: 'Preprocessed Images\n',
                1: '\nUnprocessed Images\n'}
    print(img_type[i])
    for item in cv2_thresh_list:
        print('Thresh: {}'.format(str(item)))
        _, thresh = cv2.threshold(img, 127, 255, item)
        plt.imshow(thresh, 'gray')
        f_name = '{}_{}.jpg'.format(i, str(item))
        plt.savefig(f_name)
        print('OCR Result: {}\n'.format(pytesseract.image_to_string(f_name)))

img1 will generate the following new images: img1 将生成以下新图像：

img2 will generate these new images: img2 将生成这些新图像：

为什么不能用 PIL 和 pytesseract 获取字符串？

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-07-27 05:17:26

解决方案2
2 2019-07-27 04:34:25

Add Post-Processing添加后处理

为什么不能用 PIL 和 pytesseract 获取字符串？

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-07-27 05:17:26

解决方案2 2 2019-07-27 04:34:25

Add Post-Processing添加后处理

解决方案1
5 已采纳 2019-07-27 05:17:26

解决方案2
2 2019-07-27 04:34:25