简体   繁体   English

OCR 在干净的图像上表现不佳 | Python Pytesseract

[英]OCR not performing well on clean image | Python Pytesseract

I have been working on project which involves extracting text from an image.我一直在从事涉及从图像中提取文本的项目。 I have researched that tesseract is one of the best libraries available and I decided to use the same along with opencv .我研究过tesseract是可用的最好的库之一,我决定将其与opencv一起使用。 Opencv is needed for image manipulation.图像处理需要Opencv

I have been playing a lot with tessaract engine and it does not seems to be giving the expected results to me.我一直在玩tessaract引擎,但它似乎并没有给我预期的结果。 I have attached the image as an reference.我已附上图片作为参考。 Output I got is:我得到的输出是:

1] =501 [

Instead, expected output is相反,预期的输出是

TM10-50%L

What I have done so far:到目前为止我做了什么:

  • Remove noise消除噪音
  • Adaptive threshold自适应阈值
  • Sending it tesseract ocr engine发送 tesseract ocr 引擎

Are there any other suggestions to improve the algorithm?还有其他改进算法的建议吗?

Thanks in advance.提前致谢。

Snippet of the code:代码片段:

import cv2
import sys
import pytesseract
import numpy as np
from PIL import Image

if __name__ == '__main__':
  if len(sys.argv) < 2:
    print('Usage: python ocr_simple.py image.jpg')
    sys.exit(1)

  # Read image path from command line
  imPath = sys.argv[1]
  gray  = cv2.imread(imPath, 0)
  # Blur
  blur  = cv2.GaussianBlur(gray,(9,9), 0)
  # Binarizing
  thres = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 5, 3)
  text = pytesseract.image_to_string(thresh)
  print(text)

Images attached.附上图片。 First image is original image.第一个图像是原始图像。 Original image原始图像

Second image is what has been fed to tessaract .第二张图片是输入到tessaract的图片。 Input to tessaract输入到 tessaract

Before performing OCR on an image, it's important to preprocess the image.在对图像执行 OCR 之前,对图像进行预处理很重要。 The idea is to obtain a processed image where the text to extract is in black with the background in white.这个想法是获得一个处理过的图像,其中要提取的文本是黑色的,背景是白色的。 For this specific image, we need to obtain the ROI before we can OCR.对于这个特定的图像,我们需要在 OCR 之前获得 ROI。

To do this, we can convert to grayscale , apply a slight Gaussian blur , then adaptive threshold to obtain a binary image.为此,我们可以转换为灰度,应用轻微的高斯模糊,然后自适应阈值以获得二值图像。 From here, we can apply morphological closing to merge individual letters together.从这里,我们可以应用形态闭合来将单个字母合并在一起。 Next we find contours, filter using contour area filtering, and then extract the ROI.接下来我们找到轮廓,使用轮廓区域过滤进行过滤,然后提取 ROI。 We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text.我们使用--psm 6配置选项执行文本提取,以假设单个统一的文本块。 Take a look here for more options. 在这里查看更多选项。


Detected ROI检测到的投资回报率

在此处输入图像描述

Extracted ROI提取的投资回报率

在此处输入图像描述

Result from Pytesseract OCR Pytesseract OCR 的结果

TM10=50%L

Code代码

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Adaptive threshold
image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5)

# Perform morph close to merge letters together
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)

# Find contours, contour area filtering, extract ROI
cnts, _ = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2:]
for c in cnts:
    area = cv2.contourArea(c)
    if area > 1800 and area < 2500:
        x,y,w,h = cv2.boundingRect(c)
        ROI = original[y:y+h, x:x+w]
        cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 3)

# Perform text extraction
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
print(data)

cv2.imshow('ROI', ROI)
cv2.imshow('close', close)
cv2.imshow('image', image)
cv2.waitKey()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM