[英]Python - urllib3 get text from docx using tika server
I am using python3
, urllib3
and tika-server-1.13
in order to get text from different types of files. 我正在使用
python3
, urllib3
和tika-server-1.13
以便从不同类型的文件中获取文本。 This is my python code: 这是我的python代码:
def get_text(self, input_file_path, text_output_path, content_type):
global config
headers = util.make_headers()
mime_type = ContentType.get_mime_type(content_type)
if mime_type != '':
headers['Content-Type'] = mime_type
with open(input_file_path, "rb") as input_file:
fields = {
'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
}
retry_count = 0
while retry_count < int(config.get("Tika", "RetriesCount")):
response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
if response.status == 200:
data = response.data.decode('utf-8')
text = re.sub("[\[][^\]]+[\]]", "", data)
final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
with open(text_output_path, "w+") as output_file:
output_file.write(final_text)
break
else:
if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
return False
retry_count += 1
return True
This code works for html files, but when i am trying to parse text from docx files it doesn't work. 此代码适用于html文件,但是当我尝试解析docx文件中的文本时,它不起作用。
I get back from the server Http error code 422: Unprocessable Entity
我从服务器Http错误代码
422: Unprocessable Entity
回来:无法422: Unprocessable Entity
Using the tika-server
documentation I've tried using curl
to check if it works with it: 使用
tika-server
文档,我尝试使用curl
检查它是否适用:
curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and it worked. 而且有效。
At the tika server docs : 在tika服务器文档中 :
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
422无法处理的实体-不支持的mime类型,加密的文档等
This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted. 这是正确的mime类型(也已通过tika的检测系统进行了检查),受支持,并且文件未加密。
I believe this is related to how I upload the file to the tika server, What am I doing wrong? 我认为这与我将文件上传到tika服务器有关,我做错了什么?
You're not uploading the data in the same way. 您不是以相同的方式上传数据。
--data-binary
in curl simply uploads the binary data as it is. --data-binary
仅按原样上传二进制数据。 No encoding. 没有编码。 In urllib3, using
fields
causes urllib3 to generate a multipart/form-encoded
message. 在urllib3中,使用
fields
会导致urllib3生成multipart/form-encoded
消息。 On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. 最重要的是,您将阻止urllib3在请求上正确设置该标头,以便Tika可以理解它。 Either stop updating
headers['Content-Type']
or simply pass body=input_file.read()
. 要么停止更新
headers['Content-Type']
要么简单地传递body=input_file.read()
。
I believe you can make this much easier by using the tika-python module with Client Only Mode . 我相信您可以通过将tika-python模块与Client Only Mode一起使用来简化此操作。
If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx
you will probably have issues with others. 如果您仍然坚持使用自己的客户端,则该模块的源代码中可能有一些线索来显示他如何处理所有这些不同的mime类型...如果您对
*.docx
有问题,则可能会遇到问题和其他人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.