简体   繁体   中英

Extraction of text from the image

I am trying to extract text from the image using tesseract-ocr. 图片

Result from the first image:

结果1

Now this works perfectly fine on this. 意象

Result from the second image:

在此处输入图片说明

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


print(pytesseract.image_to_string(Image.open('input.png')))

But fails to read text from the first image. I have shown the results from the first image and the second image. The only difference I can spot between the two images is the box enclosing the whole first image.

I have also done this using pdf-miner. Same result persists. I can not understand what is happening exactly. What could be the reason?

Tesseract works best when we have clean black text on solid white background. It also works well when the text is approximately horizontal and the text height is at least 20 pixels, but I have seen it to work with vertical texts as well.

If the text has a surrounding border, it may be detected as some random text, which is your case in the first image. You can either crop out the boundary or you can use text detection algorithms before performing tesseract.

Text detection algorithms in OpenCV:

Scene Text Detection

Another great tutorial

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM