简体   繁体   中英

Using Pandas to concatenate CSV files in directory, recursively

Here is a link from a previous post. I am citing PR's response below.

 import pandas as pd
    import glob
    interesting_files = glob.glob("*.csv")
    df_list = []
    for filename in sorted(interesting_files):
        df_list.append(pd.read_csv(filename))
    full_df = pd.concat(df_list)

    full_df.to_csv('output.csv')

I am wondering how to modify the above, using pandas. Specifically, I am attempting to recursively move through a directory and concatenate all of the CSV headers and their respective row values and then write it out in one file. Using PR's approach, results in all of the headers and their corresponding values being stacked upon each other. My constraints are:

  • Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other

  • If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.

  • Since each file has different column headers and different number of column headers these should all be added. Nothing should be deleted.

I have tried the following as well:

import pandas as pd
import csv
import glob
import os

path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]

for filenames in files_in_dir:
    df = pd.read_csv(filenames)
    df.to_csv('out.csv', mode='a')

Here are two sample CSV:

ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors
12821767,Query,,,,,,,,,,,

and

Type,ID,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,ISO,MID,Pass,TID,CID,Errors
UMember,12822909,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,,,,,,

Based on the above to exemplars, the output should be something along the lines of:

    ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
12821767,Query ,,,,,,,,,,,,,,,,,,,,,,,,, etc.

(all of the header columns in the second sample should be filled in with the delimiter ',' for the second row where there is no corresponding header in the first sample)

As one, can see, the second sample has more column headers. Moreover some of the headers are the same (but in different order). I am trying to combine all of these - along with their values, following the above requirements. I am wondering if the best method is to merge or perform a customizable function on a built-in method of pandas?

A non pandas based approach that uses an OrderedDict and the csv module.

from glob import iglob
import csv
from collections import OrderedDict

files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []
for filename in files:
    with open(filename, 'rb') as fin:
        csvin = csv.DictReader(fin)
        try:
            header.update(OrderedDict.fromkeys(csvin.fieldnames))
            data.append(next(csvin))
        except TypeError:
            print filename, 'was empty'
        except StopIteration:
            print filename, "didn't contain a row"

with open('output_filename.csv', 'wb') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(header))
    csvout.writeheader()
    csvout.writerows(data)

Given your example input, this gives you:

ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12821767,Query,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,

In pandas, you can both append column names and reorder the data frame easily. See this article on merging frames .

To append frames and re-order them you could use the following. Re-indexing is as simple as using a list. There are more solutions here .

import pandas,os

df = None
dfList=[]
for filename in [directory+x for x in os.listdir(path)]:
    dfList.append(pd.read_csv(filename))
df=pandas.concat(dfList)
df.to_csv('out.csv', mode='w')

With list comprehension, this would be:

import pandas,os    
pandas.concat([pd.read_csv(filename) for filename in [directory+x for x in os.listdir(path) if x.endswith("csv") is True]]).to_csv('out.csv', mode='w')

If you want to reindex anything just use a list.

cols=sorted(list(df.columns.values))
df=df[cols]
#or
df=df[sorted(list(df.columns.values))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM