简体   繁体   English

如何使用 OCR 检测图像中的下标数字?

[英]How to detect subscript numbers in an image using OCR?

I am using tesseract for OCR, via the pytesseract bindings.我通过pytesseract绑定将tesseract用于 OCR。 Unfortunately, I encounter difficulties when trying to extract text including subscript-style numbers - the subscript number is interpreted as a letter instead.不幸的是,我在尝试提取包含下标样式数字的文本时遇到了困难——下标数字被解释为一个字母。

For example, in the basic image:例如,在基本图像中:

在此处输入图像描述

I want to extract the text as "CH3", ie I am not concerned about knowing that the number 3 was a subscript in the image.我想将文本提取为“CH3”,即我不关心知道数字3是图像中的下标。

My attempt at this using tesseract is:我使用tesseract的尝试是:

import cv2
import pytesseract

img = cv2.imread('test.jpeg')

# Note that I have reduced the region of interest to the known 
# text portion of the image
text = pytesseract.image_to_string(
    img[200:300, 200:320], config='-l eng --oem 1 --psm 13'
)
print(text)

Unfortunately, this will incorrectly output不幸的是,这将错误地 output

'CHs'

It's also possible to get 'CHa' , depending on the psm parameter.也可以得到'CHa' ,这取决于psm参数。

I suspect that this issue is related to the "baseline" of the text being inconsistent across the line, but I'm not certain.我怀疑这个问题与文本的“基线”不一致有关,但我不确定。

How can I accurately extract the text from this type of image?我怎样才能准确地从这种类型的图像中提取文本?

Update - 19th May 2020更新 - 2020 年 5 月 19 日

After seeing Achintha Ihalage's answer, which doesn't provide any configuration options to tesseract , I explored the psm options.在看到 Achintha Ihalage 的答案后,它没有为tesseract提供任何配置选项,我探索了psm选项。

Since the region of interest is known (in this case, I am using EAST detection to locate the bounding box of the text), the psm config option for tesseract , which in my original code treats the text as a single line, may not be necessary.由于感兴趣的区域是已知的(在这种情况下,我使用 EAST 检测来定位文本的边界框),因此tesseractpsm配置选项(在我的原始代码中将文本视为单行)可能不是必要的。 Running image_to_string against the region of interest given by the bounding box above gives the output针对上面边界框给出的感兴趣区域运行image_to_string会得到 output

CH

3

which can, of course, be easily processed to get CH3 .当然,可以很容易地处理得到CH3

This is because the font of subscript is too small.这是因为下标字体太小了。 You could resize the image using a python package such as cv2 or PIL and use the resized image for OCR as coded below.您可以使用 python package(例如cv2PIL )调整图像大小,并将调整后的图像用于 OCR,如下所示。

import pytesseract
import cv2

img = cv2.imread('test.jpg')
img = cv2.resize(img, None, fx=2, fy=2)  # scaling factor = 2

data = pytesseract.image_to_string(img)
print(data)

OUTPUT: OUTPUT:

CH3

You want to do apply pre-processing to your image before feeding it into tesseract to increase the accuracy of the OCR.您希望在将图像输入tesseract之前对图像进行预处理,以提高 OCR 的准确性。 I use a combination of PIL and cv2 to do this here because cv2 has good filters for blur/noise removal (dilation, erosion, threshold) and PIL makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below).我在这里使用PILcv2的组合来执行此操作,因为cv2具有良好的模糊/噪声去除过滤器(膨胀、侵蚀、阈值),并且PIL可以轻松增强对比度(区分文本和背景),我想展示如何使用...进行预处理(尽管两者一起使用并不是 100% 必要的,如下所示)。 You can write this more elegantly- it's just the general idea.你可以写得更优雅——这只是一般的想法。

import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance


img = cv2.imread('test.jpg')

def cv2_preprocess(image_path):
  img = cv2.imread(image_path)

  # convert to black and white if not already
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # remove noise
  kernel = np.ones((1, 1), np.uint8)
  img = cv2.dilate(img, kernel, iterations=1)
  img = cv2.erode(img, kernel, iterations=1)

  # apply a blur 
  # gaussian noise
  img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

  # this can be used for salt and pepper noise (not necessary here)
  #img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

  cv2.imwrite('new.jpg', img)
  return 'new.jpg'

def pil_enhance(image_path):
  image = Image.open(image_path)
  contrast = ImageEnhance.Contrast(image)
  contrast.enhance(2).save('new2.jpg')
  return 'new2.jpg'


img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))


text = pytesseract.image_to_string(img)
print(text)

Output: Output:

CH3

The cv2 pre-process produces an image that looks like this: cv2预处理生成的图像如下所示: 在此处输入图像描述

The enhancement with PIL gives you: PIL的增强功能为您提供:

在此处输入图像描述

In this specific example, you can actually stop after the cv2_preprocess step because that is clear enough for the reader:在这个特定示例中,您实际上可以在cv2_preprocess步骤之后停止,因为这对读者来说已经足够清楚了:

img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)
print(text)

output: output:

CH3

But if you are working with things that don't necessarily start with a white background (ie grey scaling converts to light grey instead of white)- I have found the PIL step really helps there.但是,如果您正在处理不一定以白色背景开始的事物(即灰度转换为浅灰色而不是白色) - 我发现PIL步骤确实有帮助。

Main point is the methods to increase accuracy of the tesseract typically are:要点是提高tesseract准确性的方法通常是:

  1. fix DPI (rescaling)修复 DPI(重新缩放)
  2. fix brightness/noise of image修复图像的亮度/噪声
  3. fix tex size/lines (skewing/warping text)修复 tex 大小/线条(倾斜/扭曲文本)

Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).执行其中一项或全部三项将有所帮助……但亮度/噪音可能比其他两项更普遍(至少根据我的经验)。

I think this way can be more suitable for the general situation.我认为这种方式可以更适合一般情况。

import cv2
import pytesseract
from pathlib import Path

image = cv2.imread('test.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]  # (suitable for sharper black and white pictures
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]  # is OpenCV2.4 or OpenCV3
result_list = []
for c in contours:
    x, y, w, h = cv2.boundingRect(c)
    area = cv2.contourArea(c)
    if area > 200:
        detect_area = image[y:y + h, x:x + w]
        # detect_area = cv2.GaussianBlur(detect_area, (3, 3), 0)
        predict_char = pytesseract.image_to_string(detect_area, lang='eng', config='--oem 0 --psm 10')
        result_list.append((x, predict_char))
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), thickness=2)

result = ''.join([char for _, char in sorted(result_list, key=lambda _x: _x[0])])
print(result)  # CH3


output_dir = Path('./temp')
output_dir.mkdir(parents=True, exist_ok=True)
cv2.imwrite(f"{output_dir/Path('image.png')}", image)
cv2.imwrite(f"{output_dir/Path('clean.png')}", thresh)

MORE REFERENCE更多参考

I strongly suggest you refer to the following examples, which is a useful reference for OCR.我强烈建议您参考以下示例,这是 OCR 的有用参考。

  1. Get the location of all text present in image using opencv 使用 opencv 获取图像中所有文本的位置
  2. Using YOLO or other image recognition techniques to identify all alphanumeric text present in images 使用 YOLO 或其他图像识别技术来识别图像中存在的所有字母数字文本

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM