[英]How do I split a PDF in google cloud storage using Python
I have a single PDF that I would like to create different PDFs for each of its pages.我有一个 PDF,我想为其每个页面创建不同的 PDF。 How would I be able to so without downloading anything locally?
如果不在本地下载任何内容,我怎么能这样做? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.
我知道 Document AI 有一个文件拆分模块(它实际上可以识别不同的文件......这将是最理想的),但这不是公开的。
I am using PyPDF2 to do this curretly我正在使用 PyPDF2 来做这件事
list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))
individual_files = []
stream = io.StringIO()
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
individual_files.append(output)
with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
outputStream.write(stream.getvalue())
#print(outputStream.read())
with open(outputStream.name, 'rb') as f:
data = f.seek(85)
data = f.read()
individual_files.append(data)
bucket.blob('processed/' + "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
In the output, I see different PyPDF2 objects such as <PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0>
but I have no idea how I should proceed next.在 output 中,我看到了不同的 PyPDF2 对象,例如
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0>
我应该如何进行下一步。 I am also open to using other libraries if those work better.如果其他库效果更好,我也愿意使用其他库。
There were two reasons why my program was not working:我的程序无法运行的原因有两个:
with(open)
block outside of the first one,with(open)
块移到第一个块之外来解决这个问题, Below is the corrected code:以下是更正后的代码:
if inputpdf.numPages > 2:
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
output.write(outputStream)
with open(outputStream.name, 'rb') as f:
data = f.seek(0)
data = f.read()
#print(data)
bucket.blob(prefix + '/processed/' + "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
stream.truncate(0)
To split a PDF file in several small file (page), you need to download the data for that.要将 PDF 文件拆分为几个小文件(页面),您需要为此下载数据。 You can materialize the data in a file (in the writable directory
/tmp
) or simply keep them in memory in a python variable.您可以在文件中具体化数据(在可写目录
/tmp
中),或者简单地将它们保存在 memory 中的 python 变量中。
In both cases:在这两种情况下:
If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format,!), you can use the streaming feature of GCS .如果您绝对想在流式传输中读取数据(我不知道是否可以使用 PDF 格式,!),您可以使用GCS 的流式传输功能。 But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.
但是,因为下载的数据没有 CRC,我不会推荐你这个解决方案,除非你准备好处理损坏的数据、重试和所有相关的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.