Python PyPDF2连接页面

Question

I have a PDF with a big table splitted in pages, so I need to join the per-page tables into a big table in a large page. 我有一个PDF文件，其中的大表分为几页，因此我需要将每页表合并为大页中的大表。

Is this possible with PyPDF2 or another library? PyPDF2或其他库可能吗？

Cheers 干杯

Answer 1

Just working on something similar, it takes an input pdf and via a config file you can set the final pattern of single pages. 只需进行类似的操作，就需要输入pdf，并通过配置文件可以设置单页的最终模式。 Implementation with PyPDF2 but it still has issues with some pdf-files (have to dig deeper). 使用PyPDF2实现，但是它仍然存在一些pdf文件的问题（必须更深入地研究）。 https://github.com/Lageos/pdf-stitcher https://github.com/Lageos/pdf-stitcher

In principle adding a page right to another one works like: 原则上，将页面权限添加到另一个页面的工作原理如下：

import PyPDF2

with open('input.pdf', 'rb') as input_file:
    # load input pdf
    input_pdf = PyPDF2.PdfFileReader(input_file)

    # start new PyPDF2 PageObject
    output_pdf = input_pdf.getPage(page_number)

    # get second page PyPDF2 PageObject
    second_pdf = input_pdf.getPage(second_page_number)

    # dimensions for offset from loaded page (adding it to the right)
    offset_x = output_pdf.mediaBox[2]
    offset_y = 0

    # add second page to first one
    output_pdf.mergeTranslatedPage(second_pdf, offset_x, offset_y, expand=True)

    # write finished pdf
    with open('output.pdf', 'wb') as out_file:
            write_pdf = PyPDF2.PdfFileWriter()
            write_pdf.addPage(output_pdf)
            write_pdf.write(out_file)

Adding a page below needs an offset_y . 在下面添加页面需要offset_y 。 You can get the amount from offset_y = first_pdf.mediaBox[3] . 您可以从offset_y = first_pdf.mediaBox[3]获取金额。

Answer 2

My understanding is that this is quite hard. 我的理解是，这很难。 See here and here . 看到这里和这里。

The problem seems to be that tables aren't very well represented in pdfs but are simply made from absolutely positioned lines (see first link above). 问题似乎在于表格在pdf中的表示方式不是很好，而是仅由绝对定位的行组成（请参见上面的第一个链接）。

Here are two possible workarounds (not sure if they will do it for you): 以下是两种可能的解决方法（不确定是否会为您这样做）：

you could print multiple pages on one page and scale the page to make it readable.... 您可以在一页上打印多页并缩放页面使其可读。
open the pdf with inkscape or something similar. 用inkscape或类似的方法打开pdf。 Once ungrouped, you should have access to the individual elements that make up the tables and be able to combine them the way that suit you 取消分组后，您应该可以访问组成表格的各个元素，并能够以适合您的方式将它们组合在一起

EDIT 编辑

Have a look at libre office draw, another vector package. 看看自由办公室抽签，另一个矢量包。 I just opened a pdf in it and it seems to preserve some of the pdf structure and editing individual elements. 我刚刚打开了一个pdf文件，它似乎保留了一些pdf结构并编辑了各个元素。

EDIT 2 Have a look at pdftables which might help. 编辑2看一下可能有用的pdftables 。

PDFTables helps with extracting tables from PDF files. PDFTables可帮助从PDF文件提取表格。

I haven't tried it though... might have some time a bit later to see if I can get it to work. 不过我还没有尝试过……可能要过一段时间才能看看我是否可以使用它。

Python PyPDF2连接页面

问题描述

2 个解决方案

解决方案1
1 2016-01-07 02:58:19

解决方案2
0 2014-07-08 04:01:38

Python PyPDF2连接页面

问题描述

2 个解决方案

解决方案1 1 2016-01-07 02:58:19

解决方案2 0 2014-07-08 04:01:38

解决方案1
1 2016-01-07 02:58:19

解决方案2
0 2014-07-08 04:01:38