如何从包含在不同文件夹中的 docx 文件中提取文本

Question

I am writing a code to extract text from word document with extension of docx.我正在编写一个代码来从带有 docx 扩展名的 word 文档中提取文本。 I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document.我有一个名为“EXTRACTION”的大文件夹，该文件夹包含不同的子文件夹（例如：文件夹 1 、 2 、 3 等），每个子文件夹包含 2 到 10 个 docx 文档。 I want to extract text from each of those files and put it in a new txt file.我想从每个文件中提取文本并将其放入一个新的 txt 文件中。

I started writing this code but it is not working (second version of the code):我开始编写此代码但它不起作用（代码的第二个版本）：

import os
import glob
import docx



print(os.getcwd())

dirs = dirs = glob.glob('fi*')
path = os.getcwd()

for directory in dirs:
    for filename in directory:
        if filename.endswith(".docx") or filename.endswith(".doc"):
            document = docx.Document(filename)
            #docText = []
            with open('your_file.txt', 'w') as f:
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)

This code seems to not work , Could you help me improve这段代码似乎不起作用，你能帮我改进吗

Answer 1

In your code, directory is just a string;在您的代码中， directory只是一个字符串； so for filename in directory simply loops over f , i , c , h , i , e , r etc.所以for filename in directory只需循环f 、 i 、 c 、 h 、 i 、 e 、 r等。

Also, you were overwriting your_file.txt on each iteration.此外，您在每次迭代时都覆盖了your_file.txt 。 You want to open it once, then loop over the documents you extract from.您想打开它一次，然后遍历您从中提取的文档。

import glob
import os

import docx

with open('your_file.txt', 'w') as f:
    for directory in glob.glob('fi*'):
        for filename in glob.glob(os.path.join(directory, "*")):
            if filename.endswith((".docx", ".doc")):
                document = docx.Document(filename)    
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % item)

You are using item without declaring it so there is still a bug here;你在使用item没有声明它，所以这里仍然存在一个错误； I can't guess what you hoped for this variable to contain, so I just left it the way it was in your original code.我无法猜测您希望此变量包含什么内容，因此我只是将其保留为原始代码中的样子。

Answer 2

u can use glob.glob to get a list of all files from subdirectories你可以使用 glob.glob 从子目录中获取所有文件的列表

files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]

with open('your_file.txt', 'w') as f:
    for file in files:
        document = docx.Document(filename)    
            for paragraph in document.paragraphs:
                if paragraph.text:
                    f.write("%s\n" % item)

如何从包含在不同文件夹中的 docx 文件中提取文本

问题描述

2 个解决方案

解决方案1
2 2020-01-27 10:09:39

解决方案2
1 已采纳 2020-01-27 10:24:42

如何从包含在不同文件夹中的 docx 文件中提取文本

问题描述

2 个解决方案

解决方案1 2 2020-01-27 10:09:39

解决方案2 1 已采纳 2020-01-27 10:24:42

解决方案1
2 2020-01-27 10:09:39

解决方案2
1 已采纳 2020-01-27 10:24:42