I am trying to extract the title of a PDF file. The metadata of the file doesn't really help. So I am thinking of converting the first page of each PDF file to images and read this image using Tesseract. I can assume that the largest text found on the image is the title.
I read the PDF using fitz
and load the first page to be stored into an image format.
import fitz
doc = fitz.open(filename)
page = doc.loadPage(0)
pix = page.getPixmap()
pix.writePNG("output.png")
Then I read the image file using OpenCV, put it into tesseract, and put bounding boxes on the words detected.
filename = 'output.png'
img = cv2.imread(filename)
h, w, _ = img.shape
boxes = pytesseract.image_to_boxes(img) # also include any config options you use
for b in boxes.splitlines():
b = b.split(' ')
img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
cv2.imshow(filename, img)
cv2.waitKey(0)
I am not really familiar with OCR tesseract
so here's where I am stuck. How do I get the text with the largest bounding boxes?
My PDF files are mostly scientific papers/journals. So you get the idea of how my files look like.
Thank you.
Normally Tesseract returns the OCR operation result as a nested structure as follows:
Using pytesseract.image_to_data
you should get data about line/word index.
My suggestion is to go through the words of each line and find the line with the largest average word height, which most probably is the title of the paper.
Please refer to this answer to see how to get words boxes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.