简体   繁体   English

通过 REST API 使用 Adob​​e PDF 服务将 PDF 转换为 DOCX(在 Manjaro 中使用 Python)

[英]Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python in Manjaro) Issues

I was following the answer in this question: Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python) to be able to export a pdf document into a docx one.我正在关注这个问题的答案: Convert a PDF to DOCX using Adob​​e PDF Services via REST API (with Python)以便能够将 pdf 文档导出为 docx 文档。

I am able to successfully get the exported document data and save it into a docx file.我能够成功获取导出的文档数据并将其保存到 docx 文件中。 The problem is that when I try to open it with LibreOffice, it shows a message saying:问题是当我尝试使用 LibreOffice 打开它时,它会显示一条消息:

The file 'test.docx' is corrupt and therefore cannot be opened.文件“test.docx”已损坏,因此无法打开。 LibreOffice can try to repair the file. LibreOffice 可以尝试修复该文件。 The corruption could be the result of document manipulation or of structural document damage due to data transmission.损坏可能是由于数据传输造成的文档操作或结构文档损坏的结果。 We recommend that you do not trust the content of the repaired document.我们建议您不要相信已修复文档的内容。 Execution of macros is disabled for this document.本文档禁止执行宏。

Should LibreOffice repair the file?r the file.* LibreOffice 应该修复文件吗?r 文件。*

When I click "Yes", it complains saying:当我单击“是”时,它抱怨说:

The file 'test.docx' could not be repaired and therefore cannot be opened.文件“test.docx”无法修复,因此无法打开。

and then just close LibreOffice.然后关闭 LibreOffice。 I should also say, I also tried opening the test.docx file with google docs and with OneDrive, but neither is able to open the file.我还应该说,我也尝试使用 google docs 和 OneDrive 打开test.docx文件,但都无法打开该文件。 One Drive shows a message saying something like: One Drive 显示一条消息,内容如下:

This document cannot be opened for editing.无法打开此文档进行编辑。

Here I add the whole python script I'm using (I replace the sensible info with placeholders):在这里,我添加了我正在使用的整个 python 脚本(我用占位符替换了敏感信息):

import requests
import json
import time

url = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%7B%22reltype%22%3A%20%22http%3A%2F%2Fns.adobe.com%2Frel%2Fprimary%22%7D"

payload = {
    "cpf:engine": {
        "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
    },
    "cpf:inputs": {
        "params": {
            "cpf:inline": {
                "targetFormat": "docx"
            }
    },
    "documentIn": {
        "dc:format": "application/pdf",
        "cpf:location": "InputFile0"
    }
    },
    "cpf:outputs": {
        "documentOut": {
            "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "cpf:location": "/path_to_the_output_file/output.docx"
        }
    }
}

pdf_file = {"InputFile0": open('/path_to_the_pdf_file/mypdf.pdf','rb')}

headers = {
  'Authorization': 'Bearer My_bearer_tok',
  'Accept': 'application/json, text/plain, */*',
  'x-api-key': 'eb78dc9d04e54f10be1fd2189d91f8c9',
  'Prefer': 'respond-async,wait=0'
}

body = {"contentAnalyzerRequests": json.dumps(payload)}

response = requests.post(url=url, headers=headers, data=body, files=pdf_file)
print(response.text)
print(response.headers)
print(response.status_code)

time.sleep(5)

poll = True
while poll:
    new_request = requests.get(response.headers['location'], headers=headers)
    if new_request.status_code == 200:
        with open('test.docx', 'wb') as f:
            f.write(bytes(new_request.content))
        poll = False
    else:
        time.sleep(5)

I have also tried both suggestions on how to save the file, in the last part of the code:在代码的最后一部分,我还尝试了有关如何保存文件的两个建议:

with open('test.docx', 'wb') as f:
    f.write(new_request.content)
with open('test.docx', 'wb') as f:
    f.write(bytes(new_request.content))

But none seems to work.但似乎没有一个工作。 I also would like to comment that I was able to manually convert the pdf into docx with Adobe from their web page and load it with LibreOffice, but I need to be able to automatically do it.我还想评论一下,我能够使用 Adob​​e 从他们的网页手动将 pdf 转换为 docx 并使用 LibreOffice 加载它,但我需要能够自动执行此操作。

Update 1: I tried opening the docx file from a Windows laptop with Microsoft Word, and it also complained about the document containing information not recognized but was able to finally show the docx document.更新 1:我尝试使用 Microsoft Word 从 Windows 笔记本电脑打开 docx 文件,它还抱怨文档包含无法识别的信息,但最终能够显示 docx 文档。

Update 2: It's also possible to open the generated docx with the python pdf2docx package.更新 2:也可以使用 python pdf2docx 包打开生成的 docx。

So, any idea about what might be happening is welcome (maybe that's the expected behavior), thank you!所以,欢迎任何关于可能发生的事情的想法(也许这是预期的行为),谢谢!

From what I see, you are taking the response "as is" and saving it, but the response is actually a multipart form response.据我所知,您正在“按原样”获取响应并保存它,但响应实际上是一个多部分表单响应。 You have to parse that first, and in there are the actual bits of your data.您必须先对其进行解析,并且其中包含数据的实际位。 The multipart response includes json info + the binary, which is why you need to parse it.多部分响应包括 json 信息 + 二进制文件,这就是您需要解析它的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM