使用 Python 从 Word 文档中提取 XML 的困难

Question

I'm trying to extract the XML from a Word document with Python using the code found on this webpage.我正在尝试使用此网页上的代码从带有 Python 的 Word 文档中提取 XML。

I began by creating a test document named test.docx .我首先创建了一个名为test.docx的测试文档。 I then ran the following code:然后我运行了以下代码：

import zipfile
from lxml import etree

def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename))
    xmlContent = zip.read("word/document.xml")
    return xmlContent

def getXmlTree(xmlContent):
    return etree.fromstring(xmlContent)

testXml = getXml("test.docx")
print(getXmlTree(testXml))

Running this code produced the error message "File is not a zip file".运行此代码会产生错误消息“文件不是 zip 文件”。 What did I do wrong?我做错了什么？

Answer 1

you need to pass the path of docx file as a argument, not particularly docx file.您需要将 docx 文件的路径作为参数传递，而不是特别是 docx 文件。 compress the file and make the path in zip format压缩文件并将路径设为 zip 格式

ex: "D:/Users/John/docs/data.zip"例如：“D:/Users/John/docs/data.zip”

使用 Python 从 Word 文档中提取 XML 的困难

问题描述

1 个解决方案

解决方案1
0 2021-03-02 13:13:08

使用 Python 从 Word 文档中提取 XML 的困难

问题描述

1 个解决方案

解决方案1 0 2021-03-02 13:13:08

解决方案1
0 2021-03-02 13:13:08