简体   繁体   English

Python转换PDF

[英]Python converting PDF

I have the following code to create multiple jpgs from a single multi-page PDF. 我有以下代码,可以从一个多页PDF中创建多个jpg。 However I get the following error: wand.exceptions.BlobError: unable to open image '{uuid}.jpg': No such file or directory @ error/blob.c/OpenBlob/2841 but the image has been created. 但是,我收到以下错误: wand.exceptions.BlobError: unable to open image '{uuid}.jpg': No such file or directory @ error/blob.c/OpenBlob/2841但是图像已创建。 I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. 最初我以为可能是比赛条件,所以我输入了time.sleep()但是那也不起作用,所以我不相信就是这样。 Has anyone seen this before? 谁看过这个吗?

def split_pdf(pdf_obj, step_functions_client, task_token):
    print(time.time())

    read_pdf = PyPDF2.PdfFileReader(pdf_obj)
    images = []

    for page_num in range(read_pdf.numPages):
        output = PyPDF2.PdfFileWriter()
        output.addPage(read_pdf.getPage(page_num))

        generateduuid = str(uuid.uuid4())
        filename = generateduuid + ".pdf"
        outputfilename = generateduuid + ".jpg"
        with open(filename, "wb") as out_pdf:
            output.write(out_pdf) # write to local instead

        image = {"page": str(page_num + 1)}  # Start at 1 rather than 0

        create_image_process = subprocess.Popen(["gs", "-o " + outputfilename, "-sDEVICE=jpeg", "-r300", "-dJPEGQ=100", filename], stdout=subprocess.PIPE)
        create_image_process.wait()

        time.sleep(10)
        with(Image(filename=outputfilename)) as img:
            image["image_data"] = img.make_blob('jpeg')
            image["height"] = img.height
            image["width"] = img.width
            images.append(image)

            if hasattr(step_functions_client, 'send_task_heartbeat'):
                step_functions_client.send_task_heartbeat(taskToken=task_token)

    return images

It looks like you aren't passing in a value when you try to open the PDF in the first place - hence the error you are receiving. 尝试首先打开PDF时,似乎没有传递值-因此,您收到的错误。

Make sure you format the string with the full file path as well, eg f'/path/to/file/{uuid}.jpg' or '/path/to/file/{}.jpg'.format(uuid) 确保使用完整的文件路径格式化字符串,例如f'/path/to/file/{uuid}.jpg''/path/to/file/{}.jpg'.format(uuid)

I don't really understand why your using PyPDF2, GhostScript, and wand. 我真的不明白为什么要使用PyPDF2,GhostScript和魔杖。 You not parsing/manipulating any PostScript, and Wand sits on top of ImageMagick which sits on top of ghostscript. 您无需解析/操作任何PostScript,并且Wand位于ImageMagick的顶部,而ImageMagick则位于ghostscript的顶部。 You might be able to reduce the function down to one PDF utility. 您也许可以将功能缩减为一个PDF实用程序。

def split_pdf(pdf_obj, step_functions_client, task_token):
    images = []
    with Image(file=pdf_obj, resolution=300) as document:
        for index, page in enumerate(document.sequence):
            image = {
                "page": index + 1,
                "height": page.height,
                "width": page.width,
            }
            with Image(page) as frame:
                image["image_data"] = frame.make_blob("JPEG")
            images.append(image)
            if hasattr(step_functions_client, 'send_task_heartbeat'):
                step_functions_client.send_task_heartbeat(taskToken=task_token)
    return images

I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. 最初我以为可能是比赛条件,所以我输入了time.sleep(),但是那也不起作用,所以我不相信就是这样。 Has anyone seen this before? 谁看过这个吗?

The example code doesn't have any error handling. 该示例代码没有任何错误处理。 PDFs can be generated by many software vendors, and a lot of them do a sloppy job. PDF可以由许多软件供应商生成,并且它们中的许多工作都很草率。 It's more than possible that PyPDF or Ghostscript failed, and you never got a chance to handle this. PyPDF或Ghostscript失败的可能性很大,而您再也没有机会解决这个问题。

For example, when I use Ghostscript for PDFs generated by a random website, I often see the following message on stderr ... 例如,当我将Ghostscript用于随机网站生成的PDF时,我经常在stderr上看到以下消息...

ignoring zlib error: incorrect data check

... which results in incomplete documents, or blank pages. ...导致文档不完整或空白页。

Another common example is that the system resources have been exhausted, and no additional memory can be allocated. 另一个常见的示例是系统资源已用尽,无法分配额外的内存。 This happens all the time with web servers, and the solution is usually to migrate the task over to a queue worker that can cleanly shutdown at the end of each task-completion. Web服务器一直在发生这种情况,解决方案通常是将任务迁移到队列工作器,该工作器可以在每次任务完成时彻底关闭。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM