简体繁体 English

无法使用pyTesseract读取文字

[英]Text cannot be read using pyTesseract

原文 2018-06-13 09:19:21 8 1 python/ ocr/ tesseract

I am trying to extract logo from the PDFs. 我正在尝试从PDF中提取徽标。

I am applying GaussianBlur, finding the contours and extracting only image. 我正在应用高斯模糊，找到轮廓并仅提取图像。 But Tesseract cannot read the text from that Image? 但是Tesseract无法从该图像中读取文字吗？

这是提取的图像

1 个解决方案

Removing the frame around the letters often helps tesseract recognize texts better. 删除字母周围的框架通常有助于tesseract更好地识别文本。 So, if you try your script with the following image, you'll have a better chance of reading the logo. 因此，如果您尝试使用下图的脚本，则阅读徽标的机会更大。

With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. 话虽如此，您可能会问如何以类似的方式为该徽标和其他徽标实现此目标。 I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined. 我可以想到一些方法，但我认为最通用的解决方案可能是将文本检测算法和OCR结合在一起的管道。

Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN . 因此，您可能想要检出该存储库，该存储库提供了基于R-CNN的文本检测算法。
You can also step up your tesseract game by applying a few different image pre-processing techniques. 您还可以通过应用一些不同的图像预处理技术来增强tesseract游戏。 I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. 我最近为Tesseract和一些图像预处理技术写了一个非常简单的指南。 In case you'd like to check them out, here I'm sharing the links with you: 如果您想查看它们，请在这里与您共享链接：
- Getting started with Tesseract - Part I: Introduction Tesseract入门-第一部分：简介
- Getting started with Tesseract - Part II: Image Pre-processing Tesseract入门-第二部分：图像预处理
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here . 但是，您也对该特定徽标或字体感兴趣，也可以按照此处给出的说明尝试使用此字体来训练tesseract。