简体   繁体   English

从 python 中的 DOCX Word 文档中提取表格

[英]Extracting tables from a DOCX Word document in python

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.我正在尝试提取 DOCX Word 文档中的表格内容,而我是 xml/xpath 的新手。

from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')

This triggers "XPathEvalError: Undefined namespace prefix" error.这会触发“XPathEvalError:未定义的命名空间前缀”错误。 I'm sure it's just the first one to expect while developing the script.我敢肯定,这只是开发脚本时的第一个期望。 Unfortunately, I couldn't find a tutorial for python-docx .不幸的是,我找不到python-docx的教程。

Could you kindly provide an example of table extraction?您能否提供一个表格提取的示例?

After some back and forth, we found out that a namespace was needed for this to work correctly.经过一番反复,我们发现需要一个命名空间才能使其正常工作。 The xpath method is the appropriate solution, it just needs to have the document namespace passed in first. xpath 方法是合适的解决方案,它只需要首先传入文档命名空间。

The lxml xpath method has the details for namespace stuff. lxml xpath 方法包含命名空间内容的详细信息。 Look down the page in the link for passing a namespaces dictionary, and other details.查看链接中的页面以传递名称空间字典和其他详细信息。

As explained by mgierdal in his comment above:正如 mgierdal 在上面的评论中所解释的那样:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. tblList = document.xpath('//w:tbl', namespaces=document.nsmap) 像梦一样工作。 So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.因此,据我了解,w: 是必须扩展为完整名称空间名称的简写,并且它的字典由 document.nsmap 提供。

You can extract the table from docx using python-docx.您可以使用 python-docx 从 docx 中提取表格。 Check the following code:检查以下代码:

from docx import Document()
document = Document(file_path)

tables = document.tables

First install python-docx as mentioned by @abdulsaboor首先安装@abdulsaboor 提到的python-docx

pip install python-docx

Then this code should do:那么这段代码应该这样做:

from docx import Document


document = Document('myfile.docx')

for table in document.tables:
    print()
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end=' ')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM