简体   繁体   中英

Optimizing os.walk() iterations; On a 64Bit, 4-core, 64 GB, 2.50GHz system

I have a 64Bit, 4-core, 2.50GHz, 64GB system with 13GB free memory. I am trying to read 24 csv with around 40 mil rows with the code below;

def test():
    test = pd.DataFrame()
    rootdir ='/XYZ/A'
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            df = pd.read_csv(os.path.join(subdir, file), low_memory=False)
            test = pd.concat([test, df])
    return test

How can I optimize this to run faster, without the kernel dying. Should I be implementing this in Pyspark instead??? Please let me know if I missed any detail.

have a go at this, i used the pathlib module, since it offers more succinct and clearer code IMHO and because u can take advantage of iterators and generator expressions :

from pathlib import Path
def test():
    rootdir ='/XYZ/A'
    #assumption is that they are all csvs
    #if not. u could just use rglob('*.*')
    #this will recursively search through the directory
    #and pull all files with the extension csv
    #or all files if u use ('*.*')
    #which might be a bit more intensive computation
    all_files = Path(rootdir).rglob('*.csv')
    all_dfs = (pd.read_csv(f)
               #kindly test this aspect and c
               #stem gets u the name before '.csv'
               #and returns a string
               #rsplit splits based on the last '_'
               .assign(Date = f.stem.rsplit('_')[-1])
               for f in all_files)
    #trying to delay the intense computation till it gets here
    #hence the use of generator expressions
    final_df = pd.concat(all_dfs,ignore_index=True)
    return final_df

let's know how it goes; if it fails, i'll take it off so as not to confuse others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM