Python-urllib3使用tika服务器从docx获取文本

Question

我正在使用python3 ， urllib3和tika-server-1.13以便从不同类型的文件中获取文本。 这是我的python代码：

def get_text(self, input_file_path, text_output_path, content_type):
    global config

    headers = util.make_headers()
    mime_type = ContentType.get_mime_type(content_type)
    if mime_type != '':
        headers['Content-Type'] = mime_type

    with open(input_file_path, "rb") as input_file:
        fields = {
            'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
        }

    retry_count = 0
    while retry_count < int(config.get("Tika", "RetriesCount")):
        response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
        if response.status == 200:
            data = response.data.decode('utf-8')
            text = re.sub("[\[][^\]]+[\]]", "", data)
            final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
            with open(text_output_path, "w+") as output_file:
                output_file.write(final_text)
            break
        else:
            if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
                return False
            retry_count += 1
    return True

此代码适用于html文件，但是当我尝试解析docx文件中的文本时，它不起作用。

我从服务器Http错误代码422: Unprocessable Entity回来：无法422: Unprocessable Entity

使用tika-server 文档，我尝试使用curl检查它是否适用：

curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

而且有效。

在tika服务器文档中：

422无法处理的实体-不支持的mime类型，加密的文档等

这是正确的mime类型（也已通过tika的检测系统进行了检查），受支持，并且文件未加密。

我认为这与我将文件上传到tika服务器有关，我做错了什么？

Answer 1

您不是以相同的方式上传数据。 --data-binary仅按原样上传二进制数据。 没有编码。 在urllib3中，使用fields会导致urllib3生成multipart/form-encoded消息。 最重要的是，您将阻止urllib3在请求上正确设置该标头，以便Tika可以理解它。 要么停止更新headers['Content-Type']要么简单地传递body=input_file.read() 。

Answer 2

我相信您可以通过将tika-python模块与Client Only Mode一起使用来简化此操作。

如果您仍然坚持使用自己的客户端，则该模块的源代码中可能有一些线索来显示他如何处理所有这些不同的mime类型...如果您对*.docx有问题，则可能会遇到问题和其他人。

Python-urllib3使用tika服务器从docx获取文本

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-08-02 11:37:48

解决方案2
0 2018-01-25 18:39:37

Python-urllib3使用tika服务器从docx获取文本

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-08-02 11:37:48

解决方案2 0 2018-01-25 18:39:37

解决方案1
2 已采纳 2016-08-02 11:37:48

解决方案2
0 2018-01-25 18:39:37