简体   繁体   English

使用 olefile 从 Word .doc 中提取文本

[英]Using olefile to extract text from Word .doc

I am only concerned with getting the text from .doc files.我只关心从 .doc 文件中获取文本。 I am using python 3.6 on windows 10, so textract/antiword are off the table.我在 Windows 10 上使用 python 3.6,所以 textract/antiword 不在表中。 I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.我查看了这个问题的其他参考资料,但它们都很旧,并且与 Windows 10 和/或 python 3.6 不兼容。

My document is a .doc file with a mix of Chinese and English.我的文档是中英文混合的.doc文件。 I am not familiar with how Word stores its files, and I don't have Word on my machine.我不熟悉 Word 如何存储其文件,而且我的机器上没有 Word。 Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text.使用 olefile 我能够获取文档的字节,但我不知道如何正确遍历标题和布局以提取文本。 If I naively try如果我天真地尝试

from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')

Then the last line gives me about half the doc, starting and ending with a lot of garbage characters.然后最后一行给了我大约一半的文档,以很多垃圾字符开头和结尾。 I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files.我怀疑我可以继续尝试这种方法来逐段获取文本,但我最终需要对很多文件执行此操作。 Even if I did it this way, I can't think of a good way to automate it.即使我这样做了,我也想不出一个自动化的好方法。 How can I reliably get the text from a .doc using olefile?如何使用 olefile 从 .doc 可靠地获取文本?

(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs) (如果您知道可以使用我的规格的替代品,请随意在您的答案中包含 olefile 的替代品)

I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams".我不确定,但我认为问题在于olefile不了解 Word 文档,只了解 OLE“流”。 So I would guess that your extracted data has more than plain text in, control characters of some kind.所以我猜你提取的数据不仅仅是纯文本,还有某种控制字符。 So I guess that's why you can't decode the data you get as UTF-16.所以我想这就是为什么你不能将你得到的数据解码为 UTF-16。

There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword or catdoc .有 Python 模块可以从 doc 文件转换,但它们往往只在 Linux 上工作,在那里它们使用命令行实用程序antiwordcatdoc

I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward.我尝试了其他解决方案 - 如果问题是您没有 Word 许可证,但可以安装软件,则 LibreOffice 可能是前进的道路。 With this command, I converted a Word test file with Chinese letters from doc format to HTML :使用这个命令,我将一个带有中文字母的 Word 测试文件从doc格式转换为HTML

"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc

LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. LibreOffice 还可以转换为许多其他格式,但 HTML 应该足够简单以便进一步处理。 I also tried a port of catdoc to Windows but I couldn't get it to handle the Chinese letters.我还尝试catdoc到 Windows,但我无法让它处理中文字母。


Too bad you don't have Word installed, or you could have made it do the work for you. 太糟糕了,您没有安装 Word,或者您可以让它为您完成工作。 Leaving that solution here in case someone else has use for it: 将该解决方案留在这里以防其他人使用它:

 import win32com.client app = win32com.client.Dispatch("Word.Application") try: app.visible = False wb = app.Documents.Open('c:/temp/d.doc') doc = app.ActiveDocument with open('out.txt', 'w', encoding = 'utf-16') as f: f.write(doc.Content.Text) except Exception as e: print(e) finally: app.Quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM