简体   繁体   English

使用 Adob​​e PDF Services 通过 REST API(使用 Python)将 PDF 转换为 DOCX

[英]Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python)

I am trying to query Adobe PDF services API to generate (export) DOCX from PDF documents.我正在尝试查询 Adob​​e PDF services API 以从 PDF 文档生成(导出)DOCX。

I just wrote a python code to generate a Bearer Token in order to be identified from Adobe PDF services (see the question here: https://stackoverflow.com/questions/68351955/tunning-a-post-request-to-reach-adobe-pdf-services-using-python-and-a-rest-api ).我刚刚编写了一个 python 代码来生成一个不记名令牌,以便从Adobe PDF 服务中识别(请参阅此处的问题: https : //stackoverflow.com/questions/68351955/tunning-a-post-request-to-reach- adobe-pdf-services-using-python-and-a-rest-api )。 Then I wrote the following piece of code, where I tried to follow the instruction in this page concerning the EXPORT option of Adobe PDF services (here: https://documentcloud.adobe.com/document-services/index.html#post-exportPDF ).然后我编写了以下代码段,我尝试按照本页中有关 Adob​​e PDF 服务的EXPORT选项的说明进行操作(此处: https : //documentcloud.adobe.com/document-services/index.html#post-导出PDF )。

Here is the piece of code :这是一段代码:

import requests
import json
from requests.structures import CaseInsensitiveDict
N/B: I didn't write the part of the code generating the Token and enabling identification by the server N/B:我没有写生成Token和服务器识别的那部分代码
>> This part is a POST request to upload my PDF file via form parameters >> 这部分是通过表单参数上传我的 PDF 文件的 POST 请求
URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = CaseInsensitiveDict()
headers["x-api-key"] = "client_id"
headers["Authorization"] = "Bearer MYREALLYLONGTOKENIGOT"
headers["Content-Type"] = "application/json"

myfile = {"file":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j="""
{
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/trs_pdf_file_copy.pdf"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}"""

resp = requests.post(url=URL, headers=headers, json=json.dumps(j), files=myfile)
   

print(resp.text)
print(resp.status_code)

The status of the code is 400 I am tho well authentified by the server But I get the following as a result of print(resp.text) :代码的状态是400我已经通过服务器的身份验证但是由于print(resp.text)我得到以下结果:

{"requestId":"the_request_id","type":"Bad Request","title":"Not a multipart request. Aborting.","status":400,"report":"{\"error_code\":\"INVALID_MULTIPART_REQUEST\"}"}

I think that I have problems understanding the "form parameters" from the Adobe Guide concerning POST method for the EXPORT job of the API ( https://documentcloud.adobe.com/document-services/index.html ).我认为我在理解 Adob​​e 指南中关于 API 的 EXPORT 作业的 POST 方法的“表单参数”时遇到问题( https://documentcloud.adobe.com/document-services/index.html )。

Would you have any ideas for improvement.你有什么改进的想法。 thank you !谢谢你 !

Make you variable j as a python dict first then create a JSON string from it.首先让你变量j作为 python dict然后从中创建一个 JSON 字符串。 What's also not super clear from Adobe's documentation is the value for documentIn.cpf:location needs to be the same as the key used for you file. Adobe 的文档中也不太清楚的是documentIn.cpf:location的值需要与用于documentIn.cpf:location的密钥相同。 I've corrected this to InputFile0 in your script.我已将其更正为脚本中的InputFile0 Also guessing you want to save your file so I've added that too.还猜测你想保存你的文件,所以我也添加了它。

import requests
import json
import time

URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = {
    'Authorization': f'Bearer {token}',
    'Accept': 'application/json, text/plain, */*',
    'x-api-key': client_id,
    'Prefer': "respond-async,wait=0",
}

myfile = {"InputFile0":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j={
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "InputFile0"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}

body = {"contentAnalyzerRequests": json.dumps(j)}

resp = requests.post(url=URL, headers=headers, data=body, files=myfile)
   

print(resp.text)
print(resp.status_code)

poll = True
while poll:
    new_request = requests.get(resp.headers['location'], headers=headers)
    if new_request.status_code == 200:
        open('test.docx', 'wb').write(new_request.content)
        poll = False
    else:
        time.sleep(5)

I don't know why the docx file (its well created by the way) doesn't open, telling via popup that the content is not readable.我不知道为什么 docx 文件(顺便创建的很好)没有打开,通过弹出窗口告诉内容不可读。 maybe it's due to the 'wb' parsing methos可能是由于'wb'解析方法

I had the same issue.我遇到过同样的问题。 Typecasting to 'bytes' the request contents solved it.将请求内容类型转换为“字节”解决了它。

poll = True
    while poll:
        new_request = requests.get(resp.headers['location'], headers=headers)
        if new_request.status_code == 200:
            with open('test.docx', 'wb') as f:
                f.write(bytes(new_request.content))
            poll = False
        else:
            time.sleep(5)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM