[英]how to extract text from docx files contaning in different folders
I am writing a code to extract text from word document with extension of docx.我正在编写一个代码来从带有 docx 扩展名的 word 文档中提取文本。 I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document.
我有一个名为“EXTRACTION”的大文件夹,该文件夹包含不同的子文件夹(例如:文件夹 1 、 2 、 3 等),每个子文件夹包含 2 到 10 个 docx 文档。 I want to extract text from each of those files and put it in a new txt file.
我想从每个文件中提取文本并将其放入一个新的 txt 文件中。
I started writing this code but it is not working (second version of the code):我开始编写此代码但它不起作用(代码的第二个版本):
import os
import glob
import docx
print(os.getcwd())
dirs = dirs = glob.glob('fi*')
path = os.getcwd()
for directory in dirs:
for filename in directory:
if filename.endswith(".docx") or filename.endswith(".doc"):
document = docx.Document(filename)
#docText = []
with open('your_file.txt', 'w') as f:
for paragraph in document.paragraphs:
if paragraph.text:
#docText.append(paragraph.text)
f.write("%s\n" % paragraph.text)
This code seems to not work , Could you help me improve这段代码似乎不起作用,你能帮我改进吗
In your code, directory
is just a string;在您的代码中,
directory
只是一个字符串; so for filename in directory
simply loops over f
, i
, c
, h
, i
, e
, r
etc.所以
for filename in directory
只需循环f
、 i
、 c
、 h
、 i
、 e
、 r
等。
Also, you were overwriting your_file.txt
on each iteration.此外,您在每次迭代时都覆盖了
your_file.txt
。 You want to open it once, then loop over the documents you extract from.您想打开它一次,然后遍历您从中提取的文档。
import glob
import os
import docx
with open('your_file.txt', 'w') as f:
for directory in glob.glob('fi*'):
for filename in glob.glob(os.path.join(directory, "*")):
if filename.endswith((".docx", ".doc")):
document = docx.Document(filename)
for paragraph in document.paragraphs:
if paragraph.text:
#docText.append(paragraph.text)
f.write("%s\n" % item)
You are using item
without declaring it so there is still a bug here;你在使用
item
没有声明它,所以这里仍然存在一个错误; I can't guess what you hoped for this variable to contain, so I just left it the way it was in your original code.我无法猜测您希望此变量包含什么内容,因此我只是将其保留为原始代码中的样子。
u can use glob.glob to get a list of all files from subdirectories你可以使用 glob.glob 从子目录中获取所有文件的列表
files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]
with open('your_file.txt', 'w') as f:
for file in files:
document = docx.Document(filename)
for paragraph in document.paragraphs:
if paragraph.text:
f.write("%s\n" % item)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.