根據文件名中的日期模式從目錄中讀取文件子集的更快和更節省內存的方法是什么？

Question

我現在擁有的代碼：

cols = ['X','Y','Z','W','A']
path = r'/Desktop/files'
all_files = glob.glob(path + "/file*")
d_list = pd.date_range('2019-09-01','2020-09-09',freq='D').strftime("%Y-%m-%d").tolist()
 
list1 = []
 
for i in d_list:      
    for filename in all_files:
        if i in filename:
            df = pd.read_csv(filename,sep='|',usecols=cols)
            list1.append(df)
 
data = pd.concat(list1, axis=0, ignore_index=True)

這段代碼需要很長時間才能運行，我假設我沒有足夠的 memory。 有沒有其他方法可以讓它更快？ 如果有人知道我如何使用 dask.dataframe 並且如果這會有所幫助，但還要保留變量的原始數據類型，請告訴我。

謝謝！

Answer 1

使用 dask 嘗試以下操作：

import dask.dataframe as dd

#This is an example of a common pattern you could have for your files, so that you can loop through them one time rather than loop through a list of dates 10x.
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')

df = dd.concat([dd.read_csv(f, sep='|', usecols=cols) for f in all_files])
#df1 = df.compute() #returns a pandas dataframe from the dask dataframe

Pandas 的語法基本相同：

import pandas as pd
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')
df = pd.concat([pd.read_csv(f, sep='|', usecols=cols) for f in all_files])

根據文件名中的日期模式從目錄中讀取文件子集的更快和更節省內存的方法是什么？

問題描述

1 個解決方案

解決方案1
2 已采納 2020-07-14 21:42:21

根據文件名中的日期模式從目錄中讀取文件子集的更快和更節省內存的方法是什么？

問題描述

1 個解決方案

解決方案1 2 已采納 2020-07-14 21:42:21

解決方案1
2 已采納 2020-07-14 21:42:21