简体   繁体   中英

Python - urllib3 get text from docx using tika server

I am using python3 , urllib3 and tika-server-1.13 in order to get text from different types of files. This is my python code:

def get_text(self, input_file_path, text_output_path, content_type):
    global config

    headers = util.make_headers()
    mime_type = ContentType.get_mime_type(content_type)
    if mime_type != '':
        headers['Content-Type'] = mime_type

    with open(input_file_path, "rb") as input_file:
        fields = {
            'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
        }

    retry_count = 0
    while retry_count < int(config.get("Tika", "RetriesCount")):
        response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
        if response.status == 200:
            data = response.data.decode('utf-8')
            text = re.sub("[\[][^\]]+[\]]", "", data)
            final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
            with open(text_output_path, "w+") as output_file:
                output_file.write(final_text)
            break
        else:
            if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
                return False
            retry_count += 1
    return True

This code works for html files, but when i am trying to parse text from docx files it doesn't work.

I get back from the server Http error code 422: Unprocessable Entity

Using the tika-server documentation I've tried using curl to check if it works with it:

curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

and it worked.

At the tika server docs :

422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc

This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.

I believe this is related to how I upload the file to the tika server, What am I doing wrong?

You're not uploading the data in the same way. --data-binary in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields causes urllib3 to generate a multipart/form-encoded message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type'] or simply pass body=input_file.read() .

I believe you can make this much easier by using the tika-python module with Client Only Mode .

If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx you will probably have issues with others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM