简体   繁体   中英

Tesseract returning gibberish when performing OCR on image

I'm trying to use Tesseract to read an image, but it returns gibberish. I know I need to do some pre-processing, but what I have found online doesn't seem to work with my image. I tried this answer to turn the picture from black background/white letters to white background/black letters without success.

This is the picture.

And my simple code:

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract'

img = Image.open("2020-01-25_17-57-49_UTC.jpg")
print(pytesseract.image_to_string(img))

Cobbling code found here on SO

from PIL import Image
import PIL.ImageOps
import pytesseract

img = Image.open("8pjs0.jpg")
inverted_image = PIL.ImageOps.invert(img)
print(pytesseract.image_to_string(inverted_image))

gives me

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

I think you'll need some sort of language packs for the accented characters.

A simple Otsu's threshold to obtain a binary image then an inversion to get the letters in black and the background in white seems to work. We use --psm 3 to tell Pytesseract to perform automatic page segmentation. Take a look at Pytesseract OCR multiple config options for more configuration options. Here's the preprocessed image

Result from Pytesseract OCR

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

Code

import cv2
import numpy as np
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, threshold, invert
image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
result = 255 - thresh

# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(result, config='--psm 3')
print(data)

cv2.imshow('result', result)
cv2.waitKey()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM