Python 文本提取 Tesseract

Question

我正在尝试使用 python tesseract 从图像中提取文本。 我尝试了多次失败提取。 tesseract无法提取文本的原因是什么？ 这是图像[ ]

代码

import cv2
import pytesseract as pt
inp = "./image.jpg"
img = cv2.imread(inp)
print(pt.image_to_string(img))

版本

tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

Answer 1

你可以用opencv做一些预处理来解决这个问题

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract # pip install pytesseract
import cv2 # pip install opencv-python

# Opens the image with opencv
image = cv2.imread("test.jpg",0) #change to your file
# Preprocesses the image
thresh = cv2.threshold(image,0,255,cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Only prints allowed chars which is 0123456789:
print(pytesseract.image_to_string(thresh, lang='eng', \
           config='--psm 6 -c tessedit_char_whitelist=0123456789:'))

输出：

05:26:34
09:04:24
01:00:31
01:14:36
01:17:43
02:31:05
02:35:41
05:32:42
03:26:09
02:44:11
02:56:00
02:32:42
02:35:16
07:16:10
07:18:36
07:19:00
07:19:32
07:21:17
07:21:48

请记住，您还需要安装 tesseract 并将其添加到路径中

如果你得到很多随机的东西或者它没有找到语言“eng”，那么有一个简单的解决方法： If you are on linux cd into /usr/local/share/tessdata or /usr/share/tessdata and run

sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

这将下载英文文件并希望解决问题

Tessreact 版本：

>> tesseract --version
tesseract 4.1.1
 leptonica-1.81.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.5

Python 文本提取 Tesseract

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-29 11:29:00

Python 文本提取 Tesseract

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-29 11:29:00

解决方案1
1 已采纳 2021-06-29 11:29:00