简体   繁体   English

用 PyPDF2 裁剪 pdf

[英]crop a pdf with PyPDF2

I've been working on a project in which I extract table data from a pdf with neural network, I successfuly detect tables and get their coordinate (x,y,width,height), I've been trying to crop the pdf with pypdf2 to isolate the table but for some reason cropping never matches the desired outcome.我一直在从事一个项目,在该项目中我使用神经网络从 pdf 中提取表格数据,我成功检测表格并获取它们的坐标(x,y,宽度,高度),我一直在尝试用 pypdf2 裁剪 pdf隔离表格,但由于某种原因,裁剪永远不会符合预期的结果。 After running inference i get these coordinates运行推理后,我得到这些坐标

[[5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02 9.9353129e-01]] [[5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02 9.9353129e-01]]

the 5th number is my neural network precision, we can safely ignore it第 5 个数字是我的神经网络精度,我们可以放心地忽略它

trying them in pyplot works, so there's no problem with them:在 pyplot 作品中尝试它们,所以它们没有问题: Matplot

However using the same coords in pypdf2 is always off但是在 pypdf2 中使用相同的坐标总是关闭

from PyPDF2 import PdfFileWriter, PdfFileReader

with open("mypdf.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()

    for i in range(numPages):
        page = input1.getPage(i)
        page.cropBox.upperLeft = (5.0948269e+01,1.5970685e+02)
        page.cropBox.upperLeft = (1.1579385e+03, 2.7092386e+02)
     
        
        output.addPage(page)
        with open("out.pdf", "wb") as out_f:
          output.write(out_f)

This is the output I get:这是我得到的 output:

裁剪的 PDF Am i missing something?我错过了什么吗?

thank you !谢谢你 !

Here you go:这里是 go:

from PyPDF2 import PdfFileWriter, PdfFileReader

with open("mypdf.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()

    x, y, w, h = (5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02)

    page_x, page_y = input1.getPage(0).cropBox.getUpperLeft()
    upperLeft = [page_x.as_numeric(), page_y.as_numeric()] # convert PyPDF2.FloatObjects into floats
    new_upperLeft  = (upperLeft[0] + x, upperLeft[1] - y)
    new_lowerRight = (new_upperLeft[0] + w, new_upperLeft[1] - h)

    for i in range(numPages):
        page = input1.getPage(i)
        page.cropBox.upperLeft  = new_upperLeft
        page.cropBox.lowerRight = new_lowerRight

        output.addPage(page)

    with open("out.pdf", "wb") as out_f:
        output.write(out_f)

Note: in PyPDF2 the origin of coordinates placed in the lower left corner of a page.注意:在 PyPDF2 中坐标原点放置在页面的左下角。 And the Y-axis is directed from the bottom to up. Y轴是从下向上的。 Not like on the screen.不像荧幕上的。 So if you want to get a PDF-coordinate of top edge of your crop area you need to subtract y-coordinate of the top edge of the crop area from the height of the page.因此,如果您想获得裁剪区域顶部边缘的 PDF 坐标,则需要从页面高度中减去裁剪区域顶部边缘的 y 坐标。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM