简体   繁体   中英

How to discard cropped text from a PDF

I need to crop a pdf to extract some specific information in that pdf document. Is there a way that I can crop a pdf and only preserve the text inside the cropped area, and discard all the other text outside the cropped area?

I have tried using pyPdf to crop it using the following code.

from pyPdf import PdfFileWriter, PdfFileReader

with open("in.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()
    print "document has %s pages." % numPages

    for i in range(numPages):
        page = input1.getPage(i)
        print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
        page.trimBox.lowerLeft = (25, 25)
        page.trimBox.upperRight = (225, 225)
        page.cropBox.lowerLeft = (50, 50)
        page.cropBox.upperRight = (200, 200)
        output.addPage(page)

    with open("out.pdf", "wb") as out_f:
        output.write(out_f)

The pdf itself gets cropped, but all the text of the uncropped pdf is still preserved. If I copy all the content of the new PDF, even the cropped (invisible) text is also copied.

After i played around with your PDF and cropping i figured out thats not possible to crop and also delete invisible data.

Basicly what cropping does is adding /CropBox [ 50 50 200 200 ] element to PDF but the actual data is still remainingin the PDF.

Hint: Try to extract your data without croppping and also maybe with a lib like pdfminer , ghostscript or give PyPDF another try with extracting text or get context boxes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM