Python 获取多个docx文件名并从文件中提取特定单词以生成dataframe或表格

Question

我希望读取一个文件夹中的多个word文档（docx文件），然后从每个docx文件中搜索一个特定的单词，例如“laptop”，以生成一个表格或一个dataframe。 例如：在我的文件夹中，我有 file_1.docx、file_2.docx...file_n.docx，每个文件可能包含也可能不包含工作“笔记本电脑”。 最后我希望生成一个像这样的表：

FileName          Keyword
file_1.docx       "laptop"
file_2.docx       "laptop"
...

Answer 1

如果您使用的是 Python3.X，您将需要这样做

pip 安装 python-docx

不要与 docx 混淆，因为我在使用它时遇到了一些问题。

import os
from docx import Document
import pandas as pd

match_word = "laptop"
match_items = []
folder = 'C:\\Dev\\Docs'
file_names = os.listdir(folder)
file_names = [file for file in file_names if file.endswith('.docx')]
file_names = [os.path.join(folder, file) for file in file_names]

For file in file_names:
    document = Document(file)
    for paragraph in document.paragraphs:
        if match_word in paragraph.text:
            match_items.append([file, match_word])

the_df = pd.DataFrame(
    match_items,
    columns=['file_name', 'word_match'],
    index=[i[0] for i in match_items]
)

print(the_df)

Output：

file_name              word_match
C:\Dev\Docs\c.docx     laptop

Python 获取多个docx文件名并从文件中提取特定单词以生成dataframe或表格

问题描述

1 个解决方案

解决方案1
0 2022-09-08 06:01:06

Python 获取多个docx文件名并从文件中提取特定单词以生成dataframe或表格

问题描述

1 个解决方案

解决方案1 0 2022-09-08 06:01:06

解决方案1
0 2022-09-08 06:01:06