简体   繁体   中英

Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python in Manjaro) Issues

I was following the answer in this question: Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python) to be able to export a pdf document into a docx one.

I am able to successfully get the exported document data and save it into a docx file. The problem is that when I try to open it with LibreOffice, it shows a message saying:

The file 'test.docx' is corrupt and therefore cannot be opened. LibreOffice can try to repair the file. The corruption could be the result of document manipulation or of structural document damage due to data transmission. We recommend that you do not trust the content of the repaired document. Execution of macros is disabled for this document.

Should LibreOffice repair the file?r the file.*

When I click "Yes", it complains saying:

The file 'test.docx' could not be repaired and therefore cannot be opened.

and then just close LibreOffice. I should also say, I also tried opening the test.docx file with google docs and with OneDrive, but neither is able to open the file. One Drive shows a message saying something like:

This document cannot be opened for editing.

Here I add the whole python script I'm using (I replace the sensible info with placeholders):

import requests
import json
import time

url = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%7B%22reltype%22%3A%20%22http%3A%2F%2Fns.adobe.com%2Frel%2Fprimary%22%7D"

payload = {
    "cpf:engine": {
        "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
    },
    "cpf:inputs": {
        "params": {
            "cpf:inline": {
                "targetFormat": "docx"
            }
    },
    "documentIn": {
        "dc:format": "application/pdf",
        "cpf:location": "InputFile0"
    }
    },
    "cpf:outputs": {
        "documentOut": {
            "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "cpf:location": "/path_to_the_output_file/output.docx"
        }
    }
}

pdf_file = {"InputFile0": open('/path_to_the_pdf_file/mypdf.pdf','rb')}

headers = {
  'Authorization': 'Bearer My_bearer_tok',
  'Accept': 'application/json, text/plain, */*',
  'x-api-key': 'eb78dc9d04e54f10be1fd2189d91f8c9',
  'Prefer': 'respond-async,wait=0'
}

body = {"contentAnalyzerRequests": json.dumps(payload)}

response = requests.post(url=url, headers=headers, data=body, files=pdf_file)
print(response.text)
print(response.headers)
print(response.status_code)

time.sleep(5)

poll = True
while poll:
    new_request = requests.get(response.headers['location'], headers=headers)
    if new_request.status_code == 200:
        with open('test.docx', 'wb') as f:
            f.write(bytes(new_request.content))
        poll = False
    else:
        time.sleep(5)

I have also tried both suggestions on how to save the file, in the last part of the code:

with open('test.docx', 'wb') as f:
    f.write(new_request.content)
with open('test.docx', 'wb') as f:
    f.write(bytes(new_request.content))

But none seems to work. I also would like to comment that I was able to manually convert the pdf into docx with Adobe from their web page and load it with LibreOffice, but I need to be able to automatically do it.

Update 1: I tried opening the docx file from a Windows laptop with Microsoft Word, and it also complained about the document containing information not recognized but was able to finally show the docx document.

Update 2: It's also possible to open the generated docx with the python pdf2docx package.

So, any idea about what might be happening is welcome (maybe that's the expected behavior), thank you!

From what I see, you are taking the response "as is" and saving it, but the response is actually a multipart form response. You have to parse that first, and in there are the actual bits of your data. The multipart response includes json info + the binary, which is why you need to parse it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM