简体   繁体   English

Python-urllib3使用tika服务器从docx获取文本

[英]Python - urllib3 get text from docx using tika server

I am using python3 , urllib3 and tika-server-1.13 in order to get text from different types of files. 我正在使用python3urllib3tika-server-1.13以便从不同类型的文件中获取文本。 This is my python code: 这是我的python代码:

def get_text(self, input_file_path, text_output_path, content_type):
    global config

    headers = util.make_headers()
    mime_type = ContentType.get_mime_type(content_type)
    if mime_type != '':
        headers['Content-Type'] = mime_type

    with open(input_file_path, "rb") as input_file:
        fields = {
            'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
        }

    retry_count = 0
    while retry_count < int(config.get("Tika", "RetriesCount")):
        response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
        if response.status == 200:
            data = response.data.decode('utf-8')
            text = re.sub("[\[][^\]]+[\]]", "", data)
            final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
            with open(text_output_path, "w+") as output_file:
                output_file.write(final_text)
            break
        else:
            if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
                return False
            retry_count += 1
    return True

This code works for html files, but when i am trying to parse text from docx files it doesn't work. 此代码适用于html文件,但是当我尝试解析docx文件中的文本时,它不起作用。

I get back from the server Http error code 422: Unprocessable Entity 我从服务器Http错误代码422: Unprocessable Entity回来:无法422: Unprocessable Entity

Using the tika-server documentation I've tried using curl to check if it works with it: 使用tika-server 文档,我尝试使用curl检查它是否适用:

curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

and it worked. 而且有效。

At the tika server docs : tika服务器文档中

422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc 422无法处理的实体-不支持的mime类型,加密的文档等

This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted. 这是正确的mime类型(也已通过tika的检测系统进行了检查),受支持,并且文件未加密。

I believe this is related to how I upload the file to the tika server, What am I doing wrong? 我认为这与我将文件上传到tika服务器有关,我做错了什么?

You're not uploading the data in the same way. 您不是以相同的方式上传数据。 --data-binary in curl simply uploads the binary data as it is. --data-binary仅按原样上传二进制数据。 No encoding. 没有编码。 In urllib3, using fields causes urllib3 to generate a multipart/form-encoded message. 在urllib3中,使用fields会导致urllib3生成multipart/form-encoded消息。 On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. 最重要的是,您将阻止urllib3在请求上正确设置该标头,以便Tika可以理解它。 Either stop updating headers['Content-Type'] or simply pass body=input_file.read() . 要么停止更新headers['Content-Type']要么简单地传递body=input_file.read()

I believe you can make this much easier by using the tika-python module with Client Only Mode . 我相信您可以通过将tika-python模块与Client Only Mode一起使用来简化此操作。

If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx you will probably have issues with others. 如果您仍然坚持使用自己的客户端,则该模块的源代码中可能有一些线索来显示他如何处理所有这些不同的mime类型...如果您对*.docx有问题,则可能会遇到问题和其他人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM