简体   繁体   English

如何从包含在不同文件夹中的 docx 文件中提取文本

[英]how to extract text from docx files contaning in different folders

I am writing a code to extract text from word document with extension of docx.我正在编写一个代码来从带有 docx 扩展名的 word 文档中提取文本。 I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document.我有一个名为“EXTRACTION”的大文件夹,该文件夹包含不同的子文件夹(例如:文件夹 1 、 2 、 3 等),每个子文件夹包含 2 到 10 个 docx 文档。 I want to extract text from each of those files and put it in a new txt file.我想从每个文件中提取文本并将其放入一个新的 txt 文件中。

I started writing this code but it is not working (second version of the code):我开始编写此代码但它不起作用(代码的第二个版本):

import os
import glob
import docx



print(os.getcwd())

dirs = dirs = glob.glob('fi*')
path = os.getcwd()

for directory in dirs:
    for filename in directory:
        if filename.endswith(".docx") or filename.endswith(".doc"):
            document = docx.Document(filename)
            #docText = []
            with open('your_file.txt', 'w') as f:
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)

This code seems to not work , Could you help me improve这段代码似乎不起作用,你能帮我改进吗

在此处输入图片说明

在此处输入图片说明

In your code, directory is just a string;在您的代码中, directory只是一个字符串; so for filename in directory simply loops over f , i , c , h , i , e , r etc.所以for filename in directory只需循环fichier等。

Also, you were overwriting your_file.txt on each iteration.此外,您在每次迭代时都覆盖了your_file.txt You want to open it once, then loop over the documents you extract from.您想打开它一次,然后遍历您从中提取的文档。

import glob
import os

import docx

with open('your_file.txt', 'w') as f:
    for directory in glob.glob('fi*'):
        for filename in glob.glob(os.path.join(directory, "*")):
            if filename.endswith((".docx", ".doc")):
                document = docx.Document(filename)    
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % item)

You are using item without declaring it so there is still a bug here;你在使用item没有声明它,所以这里仍然存在一个错误; I can't guess what you hoped for this variable to contain, so I just left it the way it was in your original code.我无法猜测您希望此变量包含什么内容,因此我只是将其保留为原始代码中的样子。

u can use glob.glob to get a list of all files from subdirectories你可以使用 glob.glob 从子目录中获取所有文件的列表

files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]

with open('your_file.txt', 'w') as f:
    for file in files:
        document = docx.Document(filename)    
            for paragraph in document.paragraphs:
                if paragraph.text:
                    f.write("%s\n" % item)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM