[英]What is a more efficient way to load 1 column with 1 000 000+ rows than pandas read_csv()?
[英]What's a faster and more memory-efficient way to read_csv a subset of files from a directory based upon a date pattern in their filename?
我現在擁有的代碼:
cols = ['X','Y','Z','W','A']
path = r'/Desktop/files'
all_files = glob.glob(path + "/file*")
d_list = pd.date_range('2019-09-01','2020-09-09',freq='D').strftime("%Y-%m-%d").tolist()
list1 = []
for i in d_list:
for filename in all_files:
if i in filename:
df = pd.read_csv(filename,sep='|',usecols=cols)
list1.append(df)
data = pd.concat(list1, axis=0, ignore_index=True)
這段代碼需要很長時間才能運行,我假設我沒有足夠的 memory。 有沒有其他方法可以讓它更快? 如果有人知道我如何使用 dask.dataframe 並且如果這會有所幫助,但還要保留變量的原始數據類型,請告訴我。
謝謝!
使用 dask 嘗試以下操作:
import dask.dataframe as dd
#This is an example of a common pattern you could have for your files, so that you can loop through them one time rather than loop through a list of dates 10x.
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')
df = dd.concat([dd.read_csv(f, sep='|', usecols=cols) for f in all_files])
#df1 = df.compute() #returns a pandas dataframe from the dask dataframe
Pandas 的語法基本相同:
import pandas as pd
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')
df = pd.concat([pd.read_csv(f, sep='|', usecols=cols) for f in all_files])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.