如何提高 Tesseract 准确性

Question

我正在尝试对一组相似但大小不同的图像运行 OCR。 由于某种原因，我无法获得可预测的结果。 有什么我可以做的以获得更好的结果。

使用或不使用 cv2 预处理的 Tesseract 在某些图像上效果很好，但在某些图像上会失败，并且没有图案。 图像或多或少相似。 上图代表处理后的图像

def filter_img(img):
  # Read pil image as cv2
  img = np.array(img)
  img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

  # Converting image to grayscale (important for applying threshold)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  #Apply dilation and erosion to remove some noise
  kernel = np.ones((1, 1), np.uint8)
  # img = cv2.dilate(img, kernel, iterations=1)
  img = cv2.erode(img, kernel, iterations=1)
  # Apply blur to smooth out the edges
  img = cv2.GaussianBlur(img, (5, 5), 0)
  # img = cv.medianBlur(img,5)
  # Apply threshold to get image with only b&w (binarization)
  img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
  img = Image.fromarray(img)
  img = ImageOps.expand(img,border=2,fill='black')
  visualize.show_labeled_image(img,boxes)
  return img

# Applying Tesseract OCR
def run_tesseract(img):    
    # Tesseract cmd setup
    # pytesseract.pytesseract.tesseract_cmd = "tesseract"
    whitelist = string.ascii_uppercase + string.digits + ".-"
    parameters = '-c load_freq_dawg=0 -c tessedit_char_whitelist="{}"'.format(whitelist)
    psm = 8
    custom_oem_psm_config = "--dpi 300 --oem 3 --psm {psm} {parameters}".format(parameters=parameters, psm=psm)
    try:
      text = pytesseract.image_to_string(img, config=custom_oem_psm_config, timeout=2)
      return text.strip()
    except RuntimeError:
        print ("TIMEOUT")
    return ""

Answer 1

如果您的图像格式高度一致，您可以考虑使用拆分图像。 并且在图片ocr之后，针对容易出错的地方，对首字母或者数字进行条件判断，比如0和O容易混淆。 当然，以上所有内容只有在图像高度一致的情况下才有效。

enter code here
    import cv2
    import numpy as np
    import pytesseract
    import matplotlib.pyplot as plt
    pytesseract.pytesseract.tesseract_cmd = 'D://Program Files/Tesseract- 
    OCR/tesseract.exe'

    img = cv2.imread('vATKQ.png')

    img2 = img[100:250, 180:650]  #split to region you want
    plt.imshow(img2)
    text=pytesseract.image_to_string(img2)
    print(text)

如何提高 Tesseract 准确性

问题描述

1 个解决方案

解决方案1
0 2020-08-19 02:26:43

如何提高 Tesseract 准确性

问题描述

1 个解决方案

解决方案1 0 2020-08-19 02:26:43

解决方案1
0 2020-08-19 02:26:43