I have a 64Bit, 4-core, 2.50GHz, 64GB system with 13GB free memory. I am trying to read 24 csv with around 40 mil rows with the code below;
def test():
test = pd.DataFrame()
rootdir ='/XYZ/A'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
df = pd.read_csv(os.path.join(subdir, file), low_memory=False)
test = pd.concat([test, df])
return test
How can I optimize this to run faster, without the kernel dying. Should I be implementing this in Pyspark instead??? Please let me know if I missed any detail.
have a go at this, i used the pathlib module, since it offers more succinct and clearer code IMHO and because u can take advantage of iterators and generator expressions :
from pathlib import Path
def test():
rootdir ='/XYZ/A'
#assumption is that they are all csvs
#if not. u could just use rglob('*.*')
#this will recursively search through the directory
#and pull all files with the extension csv
#or all files if u use ('*.*')
#which might be a bit more intensive computation
all_files = Path(rootdir).rglob('*.csv')
all_dfs = (pd.read_csv(f)
#kindly test this aspect and c
#stem gets u the name before '.csv'
#and returns a string
#rsplit splits based on the last '_'
.assign(Date = f.stem.rsplit('_')[-1])
for f in all_files)
#trying to delay the intense computation till it gets here
#hence the use of generator expressions
final_df = pd.concat(all_dfs,ignore_index=True)
return final_df
let's know how it goes; if it fails, i'll take it off so as not to confuse others.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.