简体   繁体   English

合并 PDF 个文件

[英]Merge PDF files

Is it possible, using Python, to merge separate PDF files?是否可以使用 Python 合并单独的 PDF 文件?

Assuming so, I need to extend this a little further.假设是这样,我需要进一步扩展它。 I am hoping to loop through folders in a directory and repeat this procedure.我希望遍历目录中的文件夹并重复此过程。

And I may be pushing my luck, but is it possible to exclude a page that is contained in each of the PDFs (my report generation always creates an extra blank page).我可能会碰运气,但是否可以排除每个 PDF 中包含的页面(我的报告生成总是创建一个额外的空白页)。

You can use PyPdf2 s PdfMerger class.您可以使用PyPdf2PdfMerger类。

File Concatenation文件连接

You can simply concatenate files by using the append method.您可以使用append方法简单地连接文件。

from PyPDF2 import PdfMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

You can pass file handles instead file paths if you want.如果需要,您可以传递文件句柄而不是文件路径。

File Merging文件合并

If you want more fine grained control of merging there is a merge method of the PdfMerger , which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file.如果您想要对合并进行更细粒度的控制,可以使用PdfMergermerge方法,它允许您在输出文件中指定插入点,这意味着您可以将页面插入文件中的任何位置。 The append method can be thought of as a merge where the insertion point is the end of the file. append方法可以被认为是插入点是文件末尾的merge

eg例如

merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.在这里,我们将整个 pdf 插入到输出中,但在第 2 页。

Page Ranges页面范围

If you wish to control which pages are appended from a particular file, you can use the pages keyword argument of append and merge , passing a tuple in the form (start, stop[, step]) (like the regular range function).如果您希望控制从特定文件附加哪些页面,您可以使用appendmergepages关键字参数,以(start, stop[, step])的形式传递一个元组(如常规range函数)。

eg例如

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError .如果你指定了一个无效的范围,你会得到一个IndexError

Note: also that to avoid files being left open, the PdfFileMerger s close method should be called when the merged file has been written.注意:另外,为了避免文件保持打开状态,应该在写入合并文件时调用PdfFileMerger的 close 方法。 This ensures all files are closed (input and output) in a timely manner.这可确保所有文件都及时关闭(输入和输出)。 It's a shame that PdfFileMerger isn't implemented as a context manager, so we can use the with keyword, avoid the explicit close call and get some easy exception safety.遗憾的是PdfFileMerger没有实现为上下文管理器,因此我们可以使用with关键字,避免显式关闭调用并获得一些简单的异常安全性。

You might also want to look at the pdfcat script provided as part of pypdf2.您可能还想查看作为 pypdf2 的一部分提供的pdfcat脚本。 You can potentially avoid the need to write code altogether.您可以完全避免编写代码的需要。

The PyPdf2 github also includes some example code demonstrating merging. PyPdf2 github 还包含一些演示合并的示例代码。

PyMuPdf PyMuPdf

Another library perhaps worth a look is PyMuPdf .另一个可能值得一看的库是PyMuPdf Merging is equally simple.合并同样简单。

From command line:从命令行:

python -m fitz join -o result.pdf file1.pdf file2.pdf file3.pdf

and from code并从代码

import fitz

result = fitz.open()

for pdf in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
    with fitz.open(pdf) as mfile:
        result.insertPDF(mfile)
    
result.save("result.pdf")

With plenty of options, detailed in the projects wiki .有很多选项,在项目wiki中有详细说明。

Use Pypdf or its successor PyPDF2 :使用Pypdf或其继任者PyPDF2

A Pure-Python library built as a PDF toolkit.一个作为 PDF 工具包构建的纯 Python 库。 It is capable of:它能够:

  • splitting documents page by page,逐页拆分文档,
  • merging documents page by page,逐页合并文档,

(and much more) (以及更多)

Here's a sample program that works with both versions.这是一个适用于两个版本的示例程序。

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()
        output_stream.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

Merge all pdf files that are present in a dir合并目录中存在的所有pdf文件

Put the pdf files in a dir.将pdf文件放在一个目录中。 Launch the program.启动程序。 You get one pdf with all the pdfs merged.您将获得一份合并了所有 pdf 的 pdf。

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

How would I make the same code above today今天我将如何制作上面相同的代码

from glob import glob
from PyPDF2 import PdfFileMerger



def pdf_merge():
    ''' Merges all the pdf files in current directory '''
    merger = PdfFileMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("Merged_pdfs.pdf", "wb") as new_file:
        merger.write(new_file)


if __name__ == "__main__":
    pdf_merge()

The pdfrw library can do this quite easily, assuming you don't need to preserve bookmarks and annotations, and your PDFs aren't encrypted. pdfrw可以很容易地做到这一点,假设您不需要保留书签和注释,并且您的 PDF 没有加密。cat.py is an example concatenation script, andsubset.py is an example page subsetting script.cat.py是示例连接脚本,subset.py是示例页面子集脚本。

The relevant part of the concatenation script -- assumes inputs is a list of input filenames, and outfn is an output file name:连接脚本的相关部分——假设inputs是输入文件名列表, outfn是输出文件名:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

As you can see from this, it would be pretty easy to leave out the last page, eg something like:从这里可以看出,省略最后一页很容易,例如:

    writer.addpages(PdfReader(inpfn).pages[:-1])

Disclaimer: I am the primary pdfrw author.免责声明:我是pdfrw的主要作者。

Is it possible, using Python, to merge seperate PDF files?是否可以使用 Python 合并单独的 PDF 文件?

Yes.是的。

The following example merges all files in one folder to a single new PDF file:以下示例将一个文件夹中的所有文件合并为一个新的 PDF 文件:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)
from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git Git 仓库: https ://github.com/mahaguru24/Python_Merge_PDF.git

here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/ , gives an solution.在这里, http ://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/ 给出了一个解决方案。

similarly:相似地:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))

You can use pikepdf too ( source code documentation ).您也可以使用pikepdf源代码文档)。

Example code could be (taken from the documentation ):示例代码可以是(取自文档):

from glob import glob

from pikepdf import Pdf

pdf = Pdf.new()

for file in glob('*.pdf'):  # you can change this to browse directories recursively
    with Pdf.open(file) as src:
        pdf.pages.extend(src.pages)

pdf.save('merged.pdf')
pdf.close()

If you want to exclude pages, you might proceed another way, for instance copying pages to a new pdf (you can select which ones you do not copy, then, the pdf.pages object behaving like a list).如果您想排除页面,您可能会采取另一种方式,例如将页面复制到新的 pdf(您可以选择不复制的页面,然后pdf.pages对象的行为类似于列表)。

It is still actively maintained, which, as of february 2022, does not seem to be the case of PyPDF2 nor pdfrw.它仍在积极维护中,截至 2022 年 2 月,PyPDF2 和 pdfrw 似乎都不是这种情况。

I haven't benchmarked it, so I don't know if it is quicker or slower than other solutions.我没有对它进行基准测试,所以我不知道它是否比其他解决方案更快或更慢。

One advantage over PyMuPDF, in my case, is that an official Ubuntu package is available (python3-pikepdf), what is practical to package my own software depending on it.就我而言,与 PyMuPDF 相比的一个优势是可以使用官方的 Ubuntu 软件包(python3-pikepdf),根据它来打包我自己的软件是很实用的。

A slight variation using a dictionary for greater flexibility (eg sort, dedup):使用字典的细微变化以获得更大的灵活性(例如排序、重复数据删除):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

You can use PdfFileMerger from the PyPDF2 module.您可以使用PdfFileMerger模块中的PdfFileMerger

For example, to merge multiple PDF files from a list of paths you can use the following function:例如,要合并路径列表中的多个 PDF 文件,您可以使用以下函数:

from PyPDF2 import PdfFileMerger

# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
    merger   = PdfFileMerger()
    
    for pdf in extracted_files:
        merger.append(pdf)

    merger.write(out_path)
    merger.close()

merge_pdf('./final.pdf', extracted_files)

And this function to get all the files recursively from a parent folder:这个函数从父文件夹递归获取所有文件:

import os

# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
    target_files = []
    for path, subdirs, files in os.walk(parent_folder):
        for name in files:
            target_files.append(os.path.join(path, name))
    return target_files 

# get a list of all the paths of the pdf
extracted_files = fetch_all_files('./parent_folder')

Finally, you use the two functions declaring.a parent_folder_path that can contain multiple documents, and an output_pdf_path for the destination of the merged PDF:最后,您使用两个函数 declaring.a 可以包含多个文档的parent_folder_path ,以及用于合并 PDF 目标的output_pdf_path

# get a list of all the paths of the pdf
parent_folder_path = './parent_folder'
outup_pdf_path     = './final.pdf'

extracted_files = fetch_all_files(parent_folder_path)
merge_pdf(outup_pdf_path, extracted_files)

You can get the full code from here (Source): How to merge PDF documents using Python您可以从此处获取完整代码(来源): 如何使用 Python 合并 PDF 文档

I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf我通过利用子进程在 linux 终端上使用了 pdf unity(假设目录中存在 one.pdf 和 two.pdf),目的是将它们合并到 three.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

The answer from Giovanni G. PY in an easily usable way (at least for me): Giovanni G. PY 以一种易于使用的方式回答(至少对我而言):

import os
from PyPDF2 import PdfFileMerger

def merge_pdfs(export_dir, input_dir, folder):
    current_dir = os.path.join(input_dir, folder)
    pdfs = os.listdir(current_dir)
    
    merger = PdfFileMerger()
    for pdf in pdfs:
        merger.append(open(os.path.join(current_dir, pdf), 'rb'))

    with open(os.path.join(export_dir, folder + ".pdf"), "wb") as fout:
        merger.write(fout)

export_dir = r"E:\Output"
input_dir = r"E:\Input"
folders = os.listdir(input_dir)
[merge_pdfs(export_dir, input_dir, folder) for folder in folders];

Here's a time comparison for the most common answers for my specific use case: combining a list of 5 large single-page pdf files.这是针对我的特定用例的最常见答案的时间比较:组合 5 个大型单页 pdf 文件的列表。 I ran each test twice.每个测试我跑了两次。

(Disclaimer: I ran this function within Flask, your mileage may vary) (免责声明:我在 Flask 中运行此功能,您的里程可能会有所不同)

TL;DR TL;博士

pdfrw is the fastest library for combining pdfs out of the 3 I tested. pdfrw是我测试过的 3 个中组合 pdf 的最快库。

PyPDF2 PyPDF2

start = time.time()
merger = PdfFileMerger()
for pdf in all_pdf_obj:
    merger.append(
        os.path.join(
            os.getcwd(), pdf.filename # full path
                )
            )
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
merge_file = os.path.join(os.getcwd(), formatted_name)
merger.write(merge_file)
merger.close()
end = time.time()
print(end - start) #1 66.50084733963013 #2 68.2995400428772

PyMuPDF PyMuPDF

start = time.time()
result = fitz.open()

for pdf in all_pdf_obj:
    with fitz.open(os.path.join(os.getcwd(), pdf.filename)) as mfile:
        result.insertPDF(mfile)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'

result.save(formatted_name)
end = time.time()
print(end - start) #1 2.7166640758514404 #2 1.694727897644043

pdfrw pdfrw

start = time.time()
result = fitz.open()

writer = PdfWriter()
for pdf in all_pdf_obj:
    writer.addpages(PdfReader(os.path.join(os.getcwd(), pdf.filename)).pages)

formatted_name = f'Summary_Invoice_{date.today()}.pdf'
writer.write(formatted_name)
end = time.time()
print(end - start) #1 0.6040127277374268 #2 0.9576816558837891

Use right python interpreter:使用正确的 python 解释器:

conda activate py_envs

pip install PyPDF2

Python code: Python 代码:

from PyPDF2 import PdfMerger

#set path files
import os
os.chdir('/ur/path/to/folder/')
cwd = os.path.abspath('')
files = os.listdir(cwd)

def merge_pdf_files():
    merger = PdfMerger()
    pdf_files = [x for x in files if x.endswith(".pdf")]
    [merger.append(pdf) for pdf in pdf_files]
    with open("merged_pdf_all.pdf", "wb") as new_file:
        merger.write(new_file)

if __name__ == "__main__":
    merge_pdf_files()

def pdf_merger(path): """Merge the pdfs into one pdf""" def pdf_merger(path): """将 pdf 合并为一个 pdf"""

import logging
logging.basicConfig(filename = 'output.log', level = logging.DEBUG, format = '%(asctime)s %(levelname)s %(message)s' )

try:
    import glob, os
    import PyPDF2
    
    os.chdir(path)
    
    pdfs = []
    
    for file in glob.glob("*.pdf"):
        pdfs.append(file)
        
    if len(pdfs) == 0:
        logging.info("No pdf in the given directory")
        
    else:
        merger = PyPDF2.PdfFileMerger()
        
        for pdf in pdfs:
            merger.append(pdf)
            
        merger.write('result.pdf')
        merger.close()
        
except Exception as e:
    logging.error('Error has happened')
    logging.exception('Exception occured' + str(e))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM