如何使用 Python 在谷歌云存储中拆分 PDF

Question

I have a single PDF that I would like to create different PDFs for each of its pages.我有一个 PDF，我想为其每个页面创建不同的 PDF。 How would I be able to so without downloading anything locally?如果不在本地下载任何内容，我怎么能这样做？ I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.我知道 Document AI 有一个文件拆分模块（它实际上可以识别不同的文件......这将是最理想的），但这不是公开的。

I am using PyPDF2 to do this curretly我正在使用 PyPDF2 来做这件事

    list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
    print(len(list_of_blobs))
    list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
    
    inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))

    individual_files = []
    stream = io.StringIO()
    
    for i in range(inputpdf.numPages):
        output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))
        individual_files.append(output)
        with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
            outputStream.write(stream.getvalue())
            #print(outputStream.read())
            with open(outputStream.name, 'rb') as f:
                data = f.seek(85)
                data = f.read()
                individual_files.append(data)
                bucket.blob('processed/' +  "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')

In the output, I see different PyPDF2 objects such as <PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next.在 output 中，我看到了不同的 PyPDF2 对象，例如<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0>我应该如何进行下一步。 I am also open to using other libraries if those work better.如果其他库效果更好，我也愿意使用其他库。

Answer 1

There were two reasons why my program was not working:我的程序无法运行的原因有两个：

I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,我试图在 append 模式下读取文件（我通过将第二个with(open)块移到第一个块之外来解决这个问题，
I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')我应该一直在写字节（我通过将打开模式更改为“wb”而不是“a”来解决这个问题）

Below is the corrected code:以下是更正后的代码：

if inputpdf.numPages > 2:
   for i in range(inputpdf.numPages):
      output = PdfFileWriter()
      output.addPage(inputpdf.getPage(i))
      with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
           output.write(outputStream)
      with open(outputStream.name, 'rb') as f:
           data = f.seek(0)
           data = f.read()
           #print(data)
           bucket.blob(prefix + '/processed/' +  "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
      stream.truncate(0)

Answer 2

To split a PDF file in several small file (page), you need to download the data for that.要将 PDF 文件拆分为几个小文件（页面），您需要为此下载数据。 You can materialize the data in a file (in the writable directory /tmp ) or simply keep them in memory in a python variable.您可以在文件中具体化数据（在可写目录/tmp中），或者简单地将它们保存在 memory 中的 python 变量中。

In both cases:在这两种情况下：

The data will reside in memory数据将驻留在 memory
You need to get the data to perform the PDF split.您需要获取数据以执行 PDF 拆分。

If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format,!), you can use the streaming feature of GCS .如果您绝对想在流式传输中读取数据（我不知道是否可以使用 PDF 格式，！），您可以使用GCS 的流式传输功能。 But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.但是，因为下载的数据没有 CRC，我不会推荐你这个解决方案，除非你准备好处理损坏的数据、重试和所有相关的东西。

如何使用 Python 在谷歌云存储中拆分 PDF

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-05-18 16:55:14

解决方案2
1 2021-05-14 13:42:42

如何使用 Python 在谷歌云存储中拆分 PDF

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-05-18 16:55:14

解决方案2 1 2021-05-14 13:42:42

解决方案1
2 已采纳 2021-05-18 16:55:14

解决方案2
1 2021-05-14 13:42:42