简体   繁体   中英

how to make list of files in directory and process them one by one? - Python

i want to make a list of all text files in a directory. then i want to create separate list of the contents in each file. eg document1=[] and then document2=[] so on. and then by using document 1 and document 2 keywords i want to calculate term frequency and other processes. code is running but list cant be assigned different names as document1 and so on.

import glob
import math
import re

a=0
flist=glob.glob(r'D:/Final Year Project/Development process/Text_data_extraction/MyFolder/*.txt') #get all the files from the d`#open each file >> tokenize the content >> and store it in a set
for fname in flist:         
    tfile=open(fname,"r")
    line=tfile.read()
    a+=1
    line = line.lower() # lowercase
    line = re.sub("</?.*?>"," <> ",line) #remove tags
    line = re.sub("(\\d|\\W)+"," ",line)  # remove special characters and digits
    l_ist = line.split("\n")
    print 'document'
    print(l_ist)
tfile.close() # close the file
print"Number of documents:"
print(a)

You can assign the list you create in each iteration to a dict indexed by the file name:

import glob
import math
import re

flist=glob.glob(r'D:/Final Year Project/Development process/Text_data_extraction/MyFolder/*.txt') #get all the files from the d`#open each file >> tokenize the content >> and store it in a set
content = {}
for fname in flist:         
    tfile=open(fname,"r")
    line=tfile.read()
    line = line.lower() # lowercase
    line = re.sub("</?.*?>"," <> ",line) #remove tags
    line = re.sub("(\\d|\\W)+"," ",line)  # remove special characters and digits
    l_ist = line.split("\n")
    print 'document'
    print(l_ist)
    content[fname] = l_lst
tfile.close() # close the file
print("Number of documents:")
print(len(content))
print(content) # to verify the content of the entire dict

转到这里 ,我相信不要仅仅给出文本文件名,而要给出目录路径以及名称结构,对于“ document1,document2 ...”,请使用循环,或者如果设置了文件文件数,请使用它们。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM