I am new to Python and am trying to read a dataset of .txt files stored in multiple folder hierarchies. The structure of the folders is
-Folder1
-Category1_Folder
-file1.txt
-Category2_Folder
-file1.txt
-file2.txt and so on...
The categories hold significance. I need to be able to identify which file is from which category. I then need to remove stop words and perform feature extraction with TfIDf. What is the easiest way to do something like this?
I recommend os.walk
.
If you have dirs like:
project/
- folder1/
- file1.png
- file2.jpg
- folder2/
- file3.zip
Then, example code is:
import os
for dirpath, dirnames, filenames in os.walk(os.getcwd()): # getcwd() for current work dir
print(dirpath, dirnames, filenames)
Output comes:
/project ['folder1', 'folder2'] []
/project/folder1 [] ['file1.png', 'file2.jpg']
/project/folder2 [] ['file3.zip']
If you need the folder, file name, use for loop:
for dirname in dirnames:
for filename in filenames:
# split dirname for categories
# and so on..
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.