简体   繁体   English

Tesseract 在图像上执行 OCR 时返回乱码

[英]Tesseract returning gibberish when performing OCR on image

I'm trying to use Tesseract to read an image, but it returns gibberish.我正在尝试使用 Tesseract 读取图像,但它返回胡言乱语。 I know I need to do some pre-processing, but what I have found online doesn't seem to work with my image.我知道我需要做一些预处理,但我在网上找到的似乎不适用于我的图像。 I tried this answer to turn the picture from black background/white letters to white background/black letters without success.我尝试这个答案将图片从黑色背景/白色字母转换为白色背景/黑色字母,但没有成功。

This is the picture.这是图片。

And my simple code:还有我的简单代码:

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract'

img = Image.open("2020-01-25_17-57-49_UTC.jpg")
print(pytesseract.image_to_string(img))

Cobbling code found here on SO在 SO 上找到的修补代码

from PIL import Image
import PIL.ImageOps
import pytesseract

img = Image.open("8pjs0.jpg")
inverted_image = PIL.ImageOps.invert(img)
print(pytesseract.image_to_string(inverted_image))

gives me给我

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

I think you'll need some sort of language packs for the accented characters.我认为您需要为重音字符提供某种语言包。

A simple Otsu's threshold to obtain a binary image then an inversion to get the letters in black and the background in white seems to work.一个简单的 Otsu 阈值来获得二值图像,然后反转以获得黑色字母和白色背景似乎有效。 We use --psm 3 to tell Pytesseract to perform automatic page segmentation.我们使用--psm 3告诉 Pytesseract 执行自动页面分割。 Take a look at Pytesseract OCR multiple config options for more configuration options.查看Pytesseract OCR multiple config options以获取更多配置选项。 Here's the preprocessed image这是预处理后的图像

Result from Pytesseract OCR Pytesseract OCR 的结果

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

Code代码

import cv2
import numpy as np
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, threshold, invert
image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
result = 255 - thresh

# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(result, config='--psm 3')
print(data)

cv2.imshow('result', result)
cv2.waitKey()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM