简体   繁体   English

无法使用pyTesseract读取文字

[英]Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs. 我正在尝试从PDF中提取徽标。

I am applying GaussianBlur, finding the contours and extracting only image. 我正在应用高斯模糊,找到轮廓并仅提取图像。 But Tesseract cannot read the text from that Image? 但是Tesseract无法从该图像中读取文字吗?

这是提取的图像

Removing the frame around the letters often helps tesseract recognize texts better. 删除字母周围的框架通常有助于tesseract更好地识别文本。 So, if you try your script with the following image, you'll have a better chance of reading the logo. 因此,如果您尝试使用下图的脚本,则阅读徽标的机会更大。

在此处输入图片说明

With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. 话虽如此,您可能会问如何以类似的方式为该徽标和其他徽标实现此目标。 I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined. 我可以想到一些方法,但我认为最通用的解决方案可能是将文本检测算法和OCR结合在一起的管道。

  1. Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN . 因此,您可能想要检出该存储库,该存储库提供了基于R-CNN的文本检测算法
  2. You can also step up your tesseract game by applying a few different image pre-processing techniques. 您还可以通过应用一些不同的图像预处理技术来增强tesseract游戏。 I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. 我最近为Tesseract和一些图像预处理技术写了一个非常简单的指南。 In case you'd like to check them out, here I'm sharing the links with you: 如果您想查看它们,请在这里与您共享链接:

  3. However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here . 但是,您也对该特定徽标或字体感兴趣,也可以按照此处给出的说明尝试使用此字体来训练tesseract。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM