如何使用 OCR 检测图像中的下标数字？

Question

I am using tesseract for OCR, via the pytesseract bindings.我通过pytesseract绑定将tesseract用于 OCR。 Unfortunately, I encounter difficulties when trying to extract text including subscript-style numbers - the subscript number is interpreted as a letter instead.不幸的是，我在尝试提取包含下标样式数字的文本时遇到了困难——下标数字被解释为一个字母。

For example, in the basic image:例如，在基本图像中：

I want to extract the text as "CH3", ie I am not concerned about knowing that the number 3 was a subscript in the image.我想将文本提取为“CH3”，即我不关心知道数字3是图像中的下标。

My attempt at this using tesseract is:我使用tesseract的尝试是：

import cv2
import pytesseract

img = cv2.imread('test.jpeg')

# Note that I have reduced the region of interest to the known 
# text portion of the image
text = pytesseract.image_to_string(
    img[200:300, 200:320], config='-l eng --oem 1 --psm 13'
)
print(text)

Unfortunately, this will incorrectly output不幸的是，这将错误地 output

'CHs'

It's also possible to get 'CHa' , depending on the psm parameter.也可以得到'CHa' ，这取决于psm参数。

I suspect that this issue is related to the "baseline" of the text being inconsistent across the line, but I'm not certain.我怀疑这个问题与文本的“基线”不一致有关，但我不确定。

How can I accurately extract the text from this type of image?我怎样才能准确地从这种类型的图像中提取文本？

Update - 19th May 2020更新 - 2020 年 5 月 19 日

After seeing Achintha Ihalage's answer, which doesn't provide any configuration options to tesseract , I explored the psm options.在看到 Achintha Ihalage 的答案后，它没有为tesseract提供任何配置选项，我探索了psm选项。

Since the region of interest is known (in this case, I am using EAST detection to locate the bounding box of the text), the psm config option for tesseract , which in my original code treats the text as a single line, may not be necessary.由于感兴趣的区域是已知的（在这种情况下，我使用 EAST 检测来定位文本的边界框），因此tesseract的psm配置选项（在我的原始代码中将文本视为单行）可能不是必要的。 Running image_to_string against the region of interest given by the bounding box above gives the output针对上面边界框给出的感兴趣区域运行image_to_string会得到 output

CH

3

which can, of course, be easily processed to get CH3 .当然，可以很容易地处理得到CH3 。

Answer 1

This is because the font of subscript is too small.这是因为下标字体太小了。 You could resize the image using a python package such as cv2 or PIL and use the resized image for OCR as coded below.您可以使用 python package（例如cv2或PIL ）调整图像大小，并将调整后的图像用于 OCR，如下所示。

import pytesseract
import cv2

img = cv2.imread('test.jpg')
img = cv2.resize(img, None, fx=2, fy=2)  # scaling factor = 2

data = pytesseract.image_to_string(img)
print(data)

OUTPUT: OUTPUT：

CH3

Answer 2

You want to do apply pre-processing to your image before feeding it into tesseract to increase the accuracy of the OCR.您希望在将图像输入tesseract之前对图像进行预处理，以提高 OCR 的准确性。 I use a combination of PIL and cv2 to do this here because cv2 has good filters for blur/noise removal (dilation, erosion, threshold) and PIL makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below).我在这里使用PIL和cv2的组合来执行此操作，因为cv2具有良好的模糊/噪声去除过滤器（膨胀、侵蚀、阈值），并且PIL可以轻松增强对比度（区分文本和背景），我想展示如何使用...进行预处理（尽管两者一起使用并不是 100% 必要的，如下所示）。 You can write this more elegantly- it's just the general idea.你可以写得更优雅——这只是一般的想法。

import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance


img = cv2.imread('test.jpg')

def cv2_preprocess(image_path):
  img = cv2.imread(image_path)

  # convert to black and white if not already
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # remove noise
  kernel = np.ones((1, 1), np.uint8)
  img = cv2.dilate(img, kernel, iterations=1)
  img = cv2.erode(img, kernel, iterations=1)

  # apply a blur 
  # gaussian noise
  img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

  # this can be used for salt and pepper noise (not necessary here)
  #img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

  cv2.imwrite('new.jpg', img)
  return 'new.jpg'

def pil_enhance(image_path):
  image = Image.open(image_path)
  contrast = ImageEnhance.Contrast(image)
  contrast.enhance(2).save('new2.jpg')
  return 'new2.jpg'


img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))


text = pytesseract.image_to_string(img)
print(text)

Output: Output：

CH3

The cv2 pre-process produces an image that looks like this: cv2预处理生成的图像如下所示：

The enhancement with PIL gives you: PIL的增强功能为您提供：

In this specific example, you can actually stop after the cv2_preprocess step because that is clear enough for the reader:在这个特定示例中，您实际上可以在cv2_preprocess步骤之后停止，因为这对读者来说已经足够清楚了：

img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)
print(text)

output: output：

CH3

But if you are working with things that don't necessarily start with a white background (ie grey scaling converts to light grey instead of white)- I have found the PIL step really helps there.但是，如果您正在处理不一定以白色背景开始的事物（即灰度转换为浅灰色而不是白色） - 我发现PIL步骤确实有帮助。

Main point is the methods to increase accuracy of the tesseract typically are:要点是提高tesseract准确性的方法通常是：

fix DPI (rescaling)修复 DPI（重新缩放）
fix brightness/noise of image修复图像的亮度/噪声
fix tex size/lines (skewing/warping text)修复 tex 大小/线条（倾斜/扭曲文本）

Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).执行其中一项或全部三项将有所帮助……但亮度/噪音可能比其他两项更普遍（至少根据我的经验）。

Answer 3

I think this way can be more suitable for the general situation.我认为这种方式可以更适合一般情况。

import cv2
import pytesseract
from pathlib import Path

image = cv2.imread('test.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]  # (suitable for sharper black and white pictures
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]  # is OpenCV2.4 or OpenCV3
result_list = []
for c in contours:
    x, y, w, h = cv2.boundingRect(c)
    area = cv2.contourArea(c)
    if area > 200:
        detect_area = image[y:y + h, x:x + w]
        # detect_area = cv2.GaussianBlur(detect_area, (3, 3), 0)
        predict_char = pytesseract.image_to_string(detect_area, lang='eng', config='--oem 0 --psm 10')
        result_list.append((x, predict_char))
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), thickness=2)

result = ''.join([char for _, char in sorted(result_list, key=lambda _x: _x[0])])
print(result)  # CH3


output_dir = Path('./temp')
output_dir.mkdir(parents=True, exist_ok=True)
cv2.imwrite(f"{output_dir/Path('image.png')}", image)
cv2.imwrite(f"{output_dir/Path('clean.png')}", thresh)

MORE REFERENCE更多参考

I strongly suggest you refer to the following examples, which is a useful reference for OCR.我强烈建议您参考以下示例，这是 OCR 的有用参考。

如何使用 OCR 检测图像中的下标数字？

问题描述

3 个解决方案

解决方案1
4 2020-05-18 18:56:57

解决方案2
3 已采纳 2020-05-24 21:37:05

解决方案3
1 2020-05-25 03:04:55

MORE REFERENCE更多参考

如何使用 OCR 检测图像中的下标数字？

问题描述

3 个解决方案

解决方案1 4 2020-05-18 18:56:57

解决方案2 3 已采纳 2020-05-24 21:37:05

解决方案3 1 2020-05-25 03:04:55

MORE REFERENCE更多参考

解决方案1
4 2020-05-18 18:56:57

解决方案2
3 已采纳 2020-05-24 21:37:05

解决方案3
1 2020-05-25 03:04:55