[英]How can I use Python to walk through files in directories and output a pandas data frame given certain constraints?
So I'm using Pyhton, and I have a parent directory, with two child directories, in turn containing many directories, each with three files.所以我使用 Pyhton,我有一个父目录,有两个子目录,依次包含许多目录,每个目录有三个文件。 I want to take the third file (which is a.CSV file) of each of these directories, and parse them together into a pandas dataframe.
我想获取每个目录的第三个文件(即 .CSV 文件),并将它们一起解析为 pandas dataframe。 This is the code I have this far
这是我到目前为止的代码
import os
rootdir ='C:\\Dir\\Dir\\Dir\\root(parent)dir'
# os.listdir(rootdir)
# os.getcwd()
filelist = os.listdir(rootdir)
# file_count = len(filelist)
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
# if files.startswith('C74'):
for name in files:
r.append(os.path.join(root, name))
return r
filelist = list_files(rootdir)
Now with "filelist" I get all file paths contained in all directories as strings.现在使用“filelist”,我将所有目录中包含的所有文件路径作为字符串。 Now I need to find: 1. The file names that begin with three specific letters (for example funtest, in this case the first letters being fun) 2. Take every third file, and construct a pandas dataframe from that, so that I can proceed to perform data analysis.
现在我需要找到: 1. 以三个特定字母开头的文件名(例如 funtest,在这种情况下第一个字母很有趣) 2. 每隔三个文件构造一个 pandas dataframe ,这样我就可以继续进行数据分析。
IIUC we can do this much easier using a recursive function from pathlib: IIUC 我们可以使用来自 pathlib 的递归 function 更容易地做到这一点:
from pathlib import Path
csv = [f for f in Path(r'parent_dir').rglob('*C74*.csv')]
df = pd.concat([pd.read_csv(f) for f in csv])
if you want to subset your list again you could do如果您想再次对列表进行子集化,您可以这样做
subset_list = [x for x in csv if 'abc' in x.stem]
[x for x in csv if 'abc' in x.stem]
out : ['C74_abc.csv', 'abc_C74.csv']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.