[英]Read from a word file in python
How can I read from a word (docx) file in python.如何从 python 中的单词(docx)文件中读取。 I can read from a txt file but can not do the same for MS Office word document.
我可以从 txt 文件中读取,但不能对 MS Office word 文档执行相同的操作。 Any suggestions?
有什么建议么?
There are a couple of packages that let you do this.有几个包可以让你做到这一点。 Check
查看
docx2txt (note that it does not seem to work with .doc
). docx2txt (请注意,它似乎不适用于
.doc
)。 As per this , it seems to get more info than python-docx.据此,它似乎比 python-docx 获得更多信息。 From original documentation:
从原始文档:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt ). textract (通过docx2txt工作)。
Since .docx
files are simply .zip
files with a changed extension, this shows how to access the contents.由于
.docx
文件只是具有更改扩展名的.zip
文件, 因此这显示了如何访问内容。 This is a significant difference with .doc
files, and the reason why some (or all) of the above do not work with .doc
s.这是与
.doc
文件的显着差异,也是上述部分(或全部)不适用于.doc
的原因。 In this case, you would likely have to convert doc
-> docx
first.在这种情况下,您可能必须先转换
doc
-> docx
。 antiword
is an option. antiword
是一种选择。
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/请参阅允许读取 docx 文件的此库https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi.您应该使用 PyPi 上可用的 python-docx 库。 Then you can use the following
然后你可以使用以下
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.