I need to crop a pdf to extract some specific information in that pdf document. Is there a way that I can crop a pdf and only preserve the text inside the cropped area, and discard all the other text outside the cropped area?
I have tried using pyPdf to crop it using the following code.
from pyPdf import PdfFileWriter, PdfFileReader
with open("in.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
print "document has %s pages." % numPages
for i in range(numPages):
page = input1.getPage(i)
print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
page.trimBox.lowerLeft = (25, 25)
page.trimBox.upperRight = (225, 225)
page.cropBox.lowerLeft = (50, 50)
page.cropBox.upperRight = (200, 200)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
The pdf itself gets cropped, but all the text of the uncropped pdf is still preserved. If I copy all the content of the new PDF, even the cropped (invisible) text is also copied.
After i played around with your PDF and cropping i figured out thats not possible to crop and also delete invisible
data.
Basicly what cropping does is adding /CropBox [ 50 50 200 200 ]
element to PDF but the actual data is still remainingin the PDF.
Hint: Try to extract your data without croppping and also maybe with a lib like pdfminer
, ghostscript
or give PyPDF
another try with extracting text or get context boxes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.