簡體   English   中英

根據文件名中的日期模式從目錄中讀取文件子集的更快和更節省內存的方法是什么?

[英]What's a faster and more memory-efficient way to read_csv a subset of files from a directory based upon a date pattern in their filename?

我現在擁有的代碼:

cols = ['X','Y','Z','W','A']
path = r'/Desktop/files'
all_files = glob.glob(path + "/file*")
d_list = pd.date_range('2019-09-01','2020-09-09',freq='D').strftime("%Y-%m-%d").tolist()
 
list1 = []
 
for i in d_list:      
    for filename in all_files:
        if i in filename:
            df = pd.read_csv(filename,sep='|',usecols=cols)
            list1.append(df)
 
data = pd.concat(list1, axis=0, ignore_index=True)

這段代碼需要很長時間才能運行,我假設我沒有足夠的 memory。 有沒有其他方法可以讓它更快? 如果有人知道我如何使用 dask.dataframe 並且如果這會有所幫助,但還要保留變量的原始數據類型,請告訴我。

謝謝!

使用 dask 嘗試以下操作:

import dask.dataframe as dd

#This is an example of a common pattern you could have for your files, so that you can loop through them one time rather than loop through a list of dates 10x.
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')

df = dd.concat([dd.read_csv(f, sep='|', usecols=cols) for f in all_files])
#df1 = df.compute() #returns a pandas dataframe from the dask dataframe

Pandas 的語法基本相同:

import pandas as pd
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')
df = pd.concat([pd.read_csv(f, sep='|', usecols=cols) for f in all_files])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM