简体   繁体   中英

How do I get the largest text in an image using tesseract in Python?

I am trying to extract the title of a PDF file. The metadata of the file doesn't really help. So I am thinking of converting the first page of each PDF file to images and read this image using Tesseract. I can assume that the largest text found on the image is the title.

I read the PDF using fitz and load the first page to be stored into an image format.

import fitz

doc = fitz.open(filename)
page = doc.loadPage(0)
pix = page.getPixmap()
pix.writePNG("output.png")

Then I read the image file using OpenCV, put it into tesseract, and put bounding boxes on the words detected.

filename = 'output.png'

img = cv2.imread(filename)
h, w, _ = img.shape

boxes = pytesseract.image_to_boxes(img) # also include any config options you use

for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)

cv2.imshow(filename, img)
cv2.waitKey(0)

I am not really familiar with OCR tesseract so here's where I am stuck. How do I get the text with the largest bounding boxes?

My PDF files are mostly scientific papers/journals. So you get the idea of how my files look like.

Thank you.

Normally Tesseract returns the OCR operation result as a nested structure as follows:

  • Block
    • Lines
      • Words
        • Chars (only in Tesseract 3, for Tesseract 4 you only have words boxes)

Using pytesseract.image_to_data you should get data about line/word index.

My suggestion is to go through the words of each line and find the line with the largest average word height, which most probably is the title of the paper.

Please refer to this answer to see how to get words boxes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM