简体   繁体   English

如何使用 Python 在谷歌云存储中拆分 PDF

[英]How do I split a PDF in google cloud storage using Python

I have a single PDF that I would like to create different PDFs for each of its pages.我有一个 PDF,我想为其每个页面创建不同的 PDF。 How would I be able to so without downloading anything locally?如果不在本地下载任何内容,我怎么能这样做? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.我知道 Document AI 有一个文件拆分模块(它实际上可以识别不同的文件......这将是最理想的),但这不是公开的。

I am using PyPDF2 to do this curretly我正在使用 PyPDF2 来做这件事

    list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
    print(len(list_of_blobs))
    list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
    
    inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))

    individual_files = []
    stream = io.StringIO()
    
    for i in range(inputpdf.numPages):
        output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))
        individual_files.append(output)
        with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
            outputStream.write(stream.getvalue())
            #print(outputStream.read())
            with open(outputStream.name, 'rb') as f:
                data = f.seek(85)
                data = f.read()
                individual_files.append(data)
                bucket.blob('processed/' +  "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')

In the output, I see different PyPDF2 objects such as <PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next.在 output 中,我看到了不同的 PyPDF2 对象,例如<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0>我应该如何进行下一步。 I am also open to using other libraries if those work better.如果其他库效果更好,我也愿意使用其他库。

There were two reasons why my program was not working:我的程序无法运行的原因有两个:

  1. I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,我试图在 append 模式下读取文件(我通过将第二个with(open)块移到第一个块之外来解决这个问题,
  2. I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')我应该一直在写字节(我通过将打开模式更改为“wb”而不是“a”来解决这个问题)

Below is the corrected code:以下是更正后的代码:

if inputpdf.numPages > 2:
   for i in range(inputpdf.numPages):
      output = PdfFileWriter()
      output.addPage(inputpdf.getPage(i))
      with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
           output.write(outputStream)
      with open(outputStream.name, 'rb') as f:
           data = f.seek(0)
           data = f.read()
           #print(data)
           bucket.blob(prefix + '/processed/' +  "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
      stream.truncate(0)

To split a PDF file in several small file (page), you need to download the data for that.要将 PDF 文件拆分为几个小文件(页面),您需要为此下载数据。 You can materialize the data in a file (in the writable directory /tmp ) or simply keep them in memory in a python variable.您可以在文件中具体化数据(在可写目录/tmp中),或者简单地将它们保存在 memory 中的 python 变量中。

In both cases:在这两种情况下:

  • The data will reside in memory数据将驻留在 memory
  • You need to get the data to perform the PDF split.您需要获取数据以执行 PDF 拆分。

If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format,!), you can use the streaming feature of GCS .如果您绝对想在流式传输中读取数据(我不知道是否可以使用 PDF 格式,!),您可以使用GCS 的流式传输功能 But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.但是,因为下载的数据没有 CRC,我不会推荐你这个解决方案,除非你准备好处理损坏的数据、重试和所有相关的东西。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 列出存储在 Google Cloud 存储桶中的对象? - How do I list objects stored in a Google Cloud storage bucket using Python? 使用Python的Google云存储 - Google Cloud Storage using Python 如何从本地python应用访问Google云存储? - How do I access google cloud storage from local python app? 如何将超时和重试装饰器功能应用于 python 中的 google-cloud-storage 客户端? - How do I apply timeout and retry decorator functions to google-cloud-storage client in python? Python Google Cloud Storage偶尔会挂起-如何检测并中止? - Python Google Cloud Storage hangs occasionally - how do I detect and abort? 如何在由谷歌云存储完成事件触发的 python 脚本中运行 bash 脚本? - How do I run a bash script inside python script triggered by google cloud storage finalize event? 如何为我的单元测试模拟谷歌云存储功能? (Python) - How do I mock google cloud storage functions for my unittest ? (Python) 使用 python 获取某个文件后,如何从 Google 云存储桶中获取文件? - How do you fetch files from Google cloud storage bucket after a certain file is fetched using python? 如何使用Python在Android发送的Google Cloud Storage上存储图像 - how to store an image on Google Cloud Storage sent by Android using Python 如何在Python 2中使用Flask在Google云存储中添加子目录 - How to add subdirectory in google cloud storage using flask in python 2
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM