简体   繁体   中英

Python get multiple docx file names and extract specific words from the files to generate a dataframe or table

I hope to read multiple word documents (docx files) in a folder and then search a specific word eg "laptop" from each of docx file to generate a table or a dataframe. For instance: in my folder I have file_1.docx, file_2.docx... file_n.docx, each file may or may not contain work "Laptop". In the end I hope to generate a table like:

FileName          Keyword
file_1.docx       "laptop"
file_2.docx       "laptop"
...

If you are using Python3.X you will need to do

pip install python-docx

Not to be confuse with docx as I had some issues using this.

import os
from docx import Document
import pandas as pd

match_word = "laptop"
match_items = []
folder = 'C:\\Dev\\Docs'
file_names = os.listdir(folder)
file_names = [file for file in file_names if file.endswith('.docx')]
file_names = [os.path.join(folder, file) for file in file_names]

For file in file_names:
    document = Document(file)
    for paragraph in document.paragraphs:
        if match_word in paragraph.text:
            match_items.append([file, match_word])

the_df = pd.DataFrame(
    match_items,
    columns=['file_name', 'word_match'],
    index=[i[0] for i in match_items]
)

print(the_df)

Output:

file_name              word_match
C:\Dev\Docs\c.docx     laptop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM