Extract Text from MediaBox - PDF

Question

I would like to extract a certain text from a PDF based on the CropBox that I am creating.

This first part of the code I am just opening the PDF and grabbing the first page for use.

from PyPDF2 import PdfFileReader, PdfFileWriter
from pathlib import Path


pdf_path = (
     Path.home()     
     / "Documents"
     / "XXX"
     / "XXX.pdf" 
            )

pdf = PdfFileReader(str(pdf_path))
numberpages = pdf.getNumPages() #get the number of pages

first_page = pdf.getPage(0)

In this second part, I am creating the CropBox (and the coordinates based on the documentation) that I want to use.

#cm to inches = 1 cm 0.393700787 inches
inches = 0.393700787

#Coordinates in cm (x,y)

lowerLeft_in = (2,8.5)
lowerRight_in = (7,lowerLeft_in[1])
upperLeft_in = (lowerLeft_in[0],9.3)
upperRight_in = (lowerRight_in[0],upperLeft_in[1])

lowerLeft = tuple(ti1*72*inches for ti1 in lowerLeft_in)
lowerRight = tuple(ti2*72*inches for ti2 in lowerRight_in)
upperLeft = tuple(ti3*72*inches for ti3 in upperLeft_in)
upperRight = tuple(ti4*72*inches for ti4 in upperRight_in)

first_page.mediaBox.lowerLeft = lowerLeft
first_page.mediaBox.lowerRight = lowerRight
first_page.mediaBox.upperLeft = upperLeft
first_page.mediaBox.upperRight = upperRight

In this third part, I am saving this "cropped" part of the PDF to a new file.

pdf_writer = PdfFileWriter()
pdf_writer.addPage(first_page)
with Path("cropped.pdf").open(mode="wb") as output_file:
    pdf_writer.write(output_file)

If I open the new PDF created with Adobe Reader, for example, the new PDF contais only the cropped part (with is correct).

However, If I try to read the file to get the text extract, the code extract everything that was wirtten in the Page which was cropped, not only the cropped part.

pdf = PdfFileReader("cropped.pdf")
page = pdf.pages[0]
Newtext = page.extract_text() #extract all text of the PDF
print(Newtext)

Answer 1

You didn't do anything incorrectly. PyPDF2 cropping only crops the image portion of the pdf, not the text.

To get the text in a certain region, you can get all the text using pdfminer and apply a filter based on x and y positions, the values for which you probably already have from doing the above.

Extract Text from MediaBox - PDF

Question

1 answers

solution1
0 2022-07-25 22:25:40

Extract Text from MediaBox - PDF

Question

1 answers

solution1 0 2022-07-25 22:25:40

solution1
0 2022-07-25 22:25:40