[英]Cropping pages of a .pdf file
I was wondering if anyone had any experience in working programmatically with .pdf files.我想知道是否有人有任何以编程方式处理 .pdf 文件的经验。 I have a .pdf file and I need to crop every page down to a certain size.我有一个 .pdf 文件,我需要将每一页裁剪到特定大小。
After a quick Google search I found the pyPdf library for python but my experiments with it failed.在谷歌快速搜索后,我找到了 python 的 pyPdf 库,但我的实验失败了。 When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random.当我更改页面对象上的cropBox 和trimBox 属性时,结果不是我所期望的,而且似乎很随机。
Has anyone had any experience with this?有没有人有这方面的经验? Code examples would be well appreciated, preferably in python.代码示例将不胜感激,最好是在 python 中。
pyPdf does what I expect in this area. pyPdf在这方面做了我所期望的。 Using the following script:使用以下脚本:
#!/usr/bin/python
#
from pyPdf import PdfFileWriter, PdfFileReader
with open("in.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
print "document has %s pages." % numPages
for i in range(numPages):
page = input1.getPage(i)
print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
page.trimBox.lowerLeft = (25, 25)
page.trimBox.upperRight = (225, 225)
page.cropBox.lowerLeft = (50, 50)
page.cropBox.upperRight = (200, 200)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
The resulting document has a trim box that is 200x200 points and starts at 25,25 points inside the media box.生成的文档有一个 200x200 点的裁切框,从媒体框内的 25,25 点开始。 The crop box is 25 points inside the trim box.裁剪框位于修剪框内的 25 点处。
Here is how my sample document looks in acrobat professional after processing with the above code:以下是我的示例文档在使用上述代码处理后在 acrobat Professional 中的外观:
This document will appear blank when loaded in acrobat reader.该文档在 Acrobat Reader 中加载时将显示为空白。
Use this to get the dimension of pdf使用它来获取pdf的尺寸
from PyPDF2 import PdfFileWriter,PdfFileReader,PdfFileMerger
pdf_file = PdfFileReader(open("/Users/user.name/Downloads/sample.pdf","rb"))
page = pdf_file.getPage(0)
print(page.cropBox.getLowerLeft())
print(page.cropBox.getLowerRight())
print(page.cropBox.getUpperLeft())
print(page.cropBox.getUpperRight())
After this get page reference and then apply crop command在此之后获取页面参考,然后应用裁剪命令
page.mediaBox.lowerRight = (lower_right_new_x_coordinate, lower_right_new_y_coordinate)
page.mediaBox.lowerLeft = (lower_left_new_x_coordinate, lower_left_new_y_coordinate)
page.mediaBox.upperRight = (upper_right_new_x_coordinate, upper_right_new_y_coordinate)
page.mediaBox.upperLeft = (upper_left_new_x_coordinate, upper_left_new_y_coordinate)
#for example :- my custom coordinates
#page.mediaBox.lowerRight = (611, 500)
#page.mediaBox.lowerLeft = (0, 500)
#page.mediaBox.upperRight = (611, 700)
#page.mediaBox.upperLeft = (0, 700)
Thanks for all answers above.感谢上面的所有答案。
Step 1. Run the following code to get (x1, y1). Step 1. 运行以下代码得到(x1, y1)。
from PyPDF2 import PdfFileWriter, PdfFileReader
input = PdfFileReader(open("test.pdf","rb"))
page = input.getPage(0)
print(page.cropBox.getUpperRight())
Step 2. View the pdf file in full screen mode.步骤 2. 以全屏模式查看 pdf 文件。
Step 3. Capture the screen as an image file screen.jpg.步骤 3. 将屏幕捕获为图像文件 screen.jpg。
Step 4. Open screen.jpg by M$ paint or GIMP.步骤 4. 用 M$paint 或 GIMP 打开 screen.jpg。 These applications show the coordinate of the cursor.这些应用程序显示光标的坐标。
Step 5. Remember the following coordinates, (x2, y2), (x3, y3), (x4, y4) and (x5, y5), where (x4, y4) and (x5, y5) determine the rectangle you want to crop. Step 5. 记住下面的坐标,(x2, y2), (x3, y3), (x4, y4) 和 (x5, y5),其中 (x4, y4) 和 (x5, y5) 确定你想要的矩形作物。
Step 6. Get page.cropBox.upperLeft and page.cropBox.lowerRight by the following formulas.步骤 6. 通过以下公式获取 page.cropBox.upperLeft 和 page.cropBox.lowerRight。 Here is a tool for calculating.这里有一个计算工具。
page.cropBox.upperLeft = (x1*(x4-x2)/(x3-x2),(1-y4/y3)*y1)
page.cropBox.lowerRight = (x1*(x5-x2)/(x3-x2),(1-y5/y3)*y1)
Step 7. Run the following code to crop the pdf file.步骤 7. 运行以下代码来裁剪 pdf 文件。
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input = PdfFileReader(open('test.pdf', 'rb'))
n = input.getNumPages()
for i in range(n):
page = input.getPage(i)
page.cropBox.upperLeft = (100,200)
page.cropBox.lowerRight = (300,400)
output.addPage(page)
outputStream = open('result.pdf','wb')
output.write(outputStream)
outputStream.close()
You can convert the PDF to Postscript (pstopdf or ps2pdf) and than use text processing on the Postscript file.您可以将 PDF 转换为 Postscript(pstoppdf 或 ps2pdf),然后对 Postscript 文件进行文本处理。 After that you can convert the output back to PDF.之后,您可以将输出转换回 PDF。
This works nicely if the PDFs you want to process are all generated by the same application and are somewhat similar.如果您要处理的 PDF 都是由同一个应用程序生成的并且有些相似,那么这会很好地工作。 If they come from different sources it is usually to hard to process the Postscript files - the structure is varying to much.如果它们来自不同的来源,通常很难处理 Postscript 文件 - 结构变化很大。 But even than you migt be able to fix page sizes and the like with a few regular expressions.但是,即使您能够使用一些正则表达式来修复页面大小等。
Acrobat Javascript API has a setPageBoxes method, but Adobe doesn't provide any Python code samples. Acrobat Javascript API 有一个 setPageBoxes 方法,但 Adobe 不提供任何 Python 代码示例。 Only C++, C# and VB.只有 C++、C# 和 VB。
from PIL import Image
def ImageCrop():
img = Image.open("page_1.jpg")
left = 90
top = 580
right = 1600
bottom = 2000
img_res = img.crop((left, top, right, bottom))
with open(outfile4, 'w') as f:
img_res.save(outfile4,'JPEG')
ImageCrop()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.