简体   繁体   中英

Concatenating a large number of CSV files (30,000) in in Python Pandas

I'm using the following function to concatenate a large number of CSV files:

def concatenate():
    files = sort() # input is an array of filenames
    merged = pd.DataFrame()
    for file in files:
        print "concatinating" + file
        if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
            filenamearray = file.split("_")
            f = pd.read_csv(file, index_col=0)
            f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
            f.loc[:,'Year'] = filenamearray[1].replace("year", "")
            if "timelimit" in file:
                f.loc[:,'Timelimit'] = "1"
            else:
                f.loc[:,'Timelimit'] = "0"
            merged = pd.concat([merged, f], axis=0)
    merged.to_csv('merged.csv')

The problem with this function is that it doesn't handle large numbers of files (30,000) well. I tried using a sample of 100 files which finishes properly. However, for the 30,000 files the script slows down and crashes at some point.

How can I handle large numbers of files better in Python Pandas?

make a list of dfs first and then concatenate:

def concatenate():
    files = sort() # input is an array of filenames
    df_list =[]
    #merged = pd.DataFrame()
    for file in files:
        print "concatinating" + file
        if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
            filenamearray = file.split("_")
            f = pd.read_csv(file, index_col=0)
            f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
            f.loc[:,'Year'] = filenamearray[1].replace("year", "")
            if "timelimit" in file:
                f.loc[:,'Timelimit'] = "1"
            else:
                f.loc[:,'Timelimit'] = "0"
            df_list.append(f)
    merged = pd.concat(df_list, axis=0)
    merged.to_csv('merged.csv')

What you're doing is incrementally growing your df by repeatedly concatenating, it's more optimal to make a list of dfs and then concatenate all of them in one go

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM