如何使用 python docx 从多个文件中提取 Word 表

Question

我正在做一个项目，我需要分析一千多个 MS-Word 文件，每个文件都包含同一个表。 我只需要从每个表中提取几个单元格并将它们变成一行，然后将它们连接起来以创建一个日期框以供进一步分析。

我在一个文件上测试了 Python 的库 docx，它设法读取了表格。 但是，在将相同的 function 插入一个 for 循环后，该循环首先创建一个包含所有文件名的变量，然后将其传递给文档 function，Z78E6221F6393D1356681DB398F14CE 是第一个文件列表中的第一个表。

我有一种感觉，我没有以正确的方式看待这个问题，我将不胜感激任何指导，因为我现在完全无助。

以下是我使用的代码，它主要由我在 stackoverflow 中偶然发现的代码组成：

import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]

for name in file:
    document = Document(name)
    table = document.tables[0]
    data = []

    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)

        # Establish the mapping based on the first row
        # headers; these will become the keys of our dictionary
        if i == 0:
            keys = tuple(text)
            continue

        # Construct a dictionary for this row, mapping
        # keys to values for this row
        row_data = dict(zip(keys, text))
        data.append(row_data)

谢谢

Answer 1

您正在将每个文档的data列表重新初始化为[] （空）。 因此，您小心地从文档中收集行数据，然后在下一步中将其丢弃。

如果将data = []移到循环之外，则在遍历文档后它将包含所有提取的行。

data = []

for name in filenames:
    ...
    data.append(row_data)

print(data)

如何使用 python docx 从多个文件中提取 Word 表

问题描述

1 个解决方案

解决方案1
0 2021-12-15 17:54:26

如何使用 python docx 从多个文件中提取 Word 表

问题描述

1 个解决方案

解决方案1 0 2021-12-15 17:54:26

解决方案1
0 2021-12-15 17:54:26