从 python 中的 DOCX Word 文档中提取表格

Question

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.我正在尝试提取 DOCX Word 文档中的表格内容，而我是 xml/xpath 的新手。

from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')

This triggers "XPathEvalError: Undefined namespace prefix" error.这会触发“XPathEvalError：未定义的命名空间前缀”错误。 I'm sure it's just the first one to expect while developing the script.我敢肯定，这只是开发脚本时的第一个期望。 Unfortunately, I couldn't find a tutorial for python-docx .不幸的是，我找不到python-docx的教程。

Could you kindly provide an example of table extraction?您能否提供一个表格提取的示例？

Answer 1

After some back and forth, we found out that a namespace was needed for this to work correctly.经过一番反复，我们发现需要一个命名空间才能使其正常工作。 The xpath method is the appropriate solution, it just needs to have the document namespace passed in first. xpath 方法是合适的解决方案，它只需要首先传入文档命名空间。

The lxml xpath method has the details for namespace stuff. lxml xpath 方法包含命名空间内容的详细信息。 Look down the page in the link for passing a namespaces dictionary, and other details.查看链接中的页面以传递名称空间字典和其他详细信息。

As explained by mgierdal in his comment above:正如 mgierdal 在上面的评论中所解释的那样：

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. tblList = document.xpath('//w:tbl', namespaces=document.nsmap) 像梦一样工作。 So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.因此，据我了解，w: 是必须扩展为完整名称空间名称的简写，并且它的字典由 document.nsmap 提供。

Answer 2

You can extract the table from docx using python-docx.您可以使用 python-docx 从 docx 中提取表格。 Check the following code:检查以下代码：

from docx import Document()
document = Document(file_path)

tables = document.tables

Answer 3

First install python-docx as mentioned by @abdulsaboor首先安装@abdulsaboor 提到的python-docx

pip install python-docx

Then this code should do:那么这段代码应该这样做：

from docx import Document


document = Document('myfile.docx')

for table in document.tables:
    print()
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end=' ')

从 python 中的 DOCX Word 文档中提取表格

问题描述

3 个解决方案

解决方案1
3 已采纳 2011-08-18 19:18:26

解决方案2
0 2019-08-19 12:33:29

解决方案3
0 2021-02-18 11:50:43

从 python 中的 DOCX Word 文档中提取表格

问题描述

3 个解决方案

解决方案1 3 已采纳 2011-08-18 19:18:26

解决方案2 0 2019-08-19 12:33:29

解决方案3 0 2021-02-18 11:50:43

解决方案1
3 已采纳 2011-08-18 19:18:26

解决方案2
0 2019-08-19 12:33:29

解决方案3
0 2021-02-18 11:50:43