简体   繁体   English

从图像中提取文本

[英]Extraction of text from the image

I am trying to extract text from the image using tesseract-ocr. 我正在尝试使用tesseract-ocr从图像中提取文本。 图片

Result from the first image: 第一张图片的结果:

结果1

Now this works perfectly fine on this. 现在,这在此上可以正常工作。 意象

Result from the second image: 第二张图片的结果:

在此处输入图片说明

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


print(pytesseract.image_to_string(Image.open('input.png')))

But fails to read text from the first image. 但是无法读取第一个图像中的文本。 I have shown the results from the first image and the second image. 我已经显示了第一张图片和第二张图片的结果。 The only difference I can spot between the two images is the box enclosing the whole first image. 我能看到的两个图像之间的唯一区别是将整个第一幅图像围起来的盒子。

I have also done this using pdf-miner. 我也使用pdf-miner完成了此操作。 Same result persists. 同样的结果仍然存在。 I can not understand what is happening exactly. 我不明白到底发生了什么。 What could be the reason? 可能是什么原因?

Tesseract works best when we have clean black text on solid white background. 当我们在纯白色背景上使用干净的黑色文本时,Tesseract效果最佳。 It also works well when the text is approximately horizontal and the text height is at least 20 pixels, but I have seen it to work with vertical texts as well. 当文本近似水平且文本高度至少为20像素时,它也可以很好地工作,但是我已经看到它也可以用于垂直文本。

If the text has a surrounding border, it may be detected as some random text, which is your case in the first image. 如果文本有边框,则可能会将其检测为某些随机文本,在第一张图像中就是这种情况。 You can either crop out the boundary or you can use text detection algorithms before performing tesseract. 您可以裁剪边界,也可以在执行tesseract之前使用文本检测算法。

Text detection algorithms in OpenCV: OpenCV中的文本检测算法:

Scene Text Detection 场景文字检测

Another great tutorial 另一个很棒的教程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM