Extraction of text from the image

Question

I am trying to extract text from the image using tesseract-ocr.

Result from the first image:

Now this works perfectly fine on this.

Result from the second image:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


print(pytesseract.image_to_string(Image.open('input.png')))

But fails to read text from the first image. I have shown the results from the first image and the second image. The only difference I can spot between the two images is the box enclosing the whole first image.

I have also done this using pdf-miner. Same result persists. I can not understand what is happening exactly. What could be the reason?

Answer 1

Tesseract works best when we have clean black text on solid white background. It also works well when the text is approximately horizontal and the text height is at least 20 pixels, but I have seen it to work with vertical texts as well.

If the text has a surrounding border, it may be detected as some random text, which is your case in the first image. You can either crop out the boundary or you can use text detection algorithms before performing tesseract.

Text detection algorithms in OpenCV:

Scene Text Detection

Another great tutorial

Extraction of text from the image

Question

1 answers

solution1
0 2019-02-14 15:38:23

Extraction of text from the image

Question

1 answers

solution1 0 2019-02-14 15:38:23

solution1
0 2019-02-14 15:38:23