How to discard cropped text from a PDF

Question

I need to crop a pdf to extract some specific information in that pdf document. Is there a way that I can crop a pdf and only preserve the text inside the cropped area, and discard all the other text outside the cropped area?

I have tried using pyPdf to crop it using the following code.

from pyPdf import PdfFileWriter, PdfFileReader

with open("in.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()
    print "document has %s pages." % numPages

    for i in range(numPages):
        page = input1.getPage(i)
        print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
        page.trimBox.lowerLeft = (25, 25)
        page.trimBox.upperRight = (225, 225)
        page.cropBox.lowerLeft = (50, 50)
        page.cropBox.upperRight = (200, 200)
        output.addPage(page)

    with open("out.pdf", "wb") as out_f:
        output.write(out_f)

The pdf itself gets cropped, but all the text of the uncropped pdf is still preserved. If I copy all the content of the new PDF, even the cropped (invisible) text is also copied.

Answer 1

After i played around with your PDF and cropping i figured out thats not possible to crop and also delete invisible data.

Basicly what cropping does is adding /CropBox [ 50 50 200 200 ] element to PDF but the actual data is still remainingin the PDF.

Hint: Try to extract your data without croppping and also maybe with a lib like pdfminer , ghostscript or give PyPDF another try with extracting text or get context boxes.

How to discard cropped text from a PDF

Question

1 answers

solution1
0 2019-04-23 08:29:15

How to discard cropped text from a PDF

Question

1 answers

solution1 0 2019-04-23 08:29:15

solution1
0 2019-04-23 08:29:15