简体   繁体   中英

Appending CSV files, matching unordered columns

Problem: matching columns while appending CSV files

I have 50 .csv files where each column is a word, each row is a time of day and each file holds all words for one day. They look like this:

Date  Time Aword Bword Cword Dword
Date1 t1   0     1     0     12
Date1 t2   0     6     3     0

Date  Time Eword Fword Gword Hword Bword
Date2 t1   0     0     1     0     3
Date2 t2   2     0     0     19    0

I want to append the files so that any columns with the same word (like Bword in this example) are matched while new words are added in new columns:

Date  Time Aword Bword Cword Dword Eword Fword Gword Hword
Date1 t1   0     1     0     12                       
Date1 t2   0     6     3     0                        
Date2 t1         3                 0     0     1     0   
Date2 t2         0                 2     0     0     19

I'm opening the csv files as dataframes to manipulate them and using dataframe.append the new files are added like this:

Date  Time Aword Bword Cword Dword
Date1 t1   0     1     0     12
Date1 t2   0     6     3     0
Date  Time Eword Fword Gword Hword Bword
Date2 t1   0     0     1     0     3
Dat2e t2   2     0     0     19    0

Is there a different approach which could align matching columns while appending? ie without iterating through each column and checking for matches.

Sincere apologies if this question is too vague, I'm new to python and still struggling to know when I'm thinking un-pythonically and when I'm using the wrong tools.

EDIT: more information
1) I'll need to perform this task multiple times, once for each of five batches of csvs
2) The files all have 25 rows but have anything from 5 to 294 columns
3) The order of rows is important Day1(t1, t2...tn) then Day2(t1, t2...tn)
4) The order of columns is not important

I think for this kind of thing you might find using the pandas library a bit easier. Say filelist is a list of file names.

import pandas as pd

df = pd.concat([pd.read_csv(fl, index_col=[0,1]) for fl in filelist])

And you're done! As a side note if you'd like to combine the date and time columns (depending on their format) you can try

df = pd.concat([pd.read_csv(fl, parse_dates=['Date','Time']) for fl in filelist]).drop('Date', axis=1)

IIUC, you can simply use pd.concat , which will automatically align on columns:

>>> csvs = glob.glob("*.csv")
>>> dfs = [pd.read_csv(csv) for csv in csvs]
>>> df_merged = pd.concat(dfs).fillna("")
>>> df_merged
  Aword  Bword Cword   Date Dword Eword Fword Gword Hword Time
0     0      1     0  Date1    12                           t1
1     0      6     3  Date1     0                           t2
0            3        Date2           0     0     1     0   t1
1            0        Date2           2     0     0    19   t2

(Although I'd recommend either fillna(0) or leaving it as nan ; if you fill with an empty string to look like your desired output, the column has to have object dtype and those are much slower than int or float.)

If you're really particular about the column order, you could cheat and use (re)set_index :

>>> df_merged.set_index(["Date", "Time"]).reset_index()
    Date Time Aword  Bword Cword Dword Eword Fword Gword Hword
0  Date1   t1     0      1     0    12                        
1  Date1   t2     0      6     3     0                        
2  Date2   t1            3                 0     0     1     0
3  Date2   t2            0                 2     0     0    19

If the order of rows and columns is not important (if it is, you need to edit your Q to specify how to deal with it when the order differs among files!), there are no conflicts (different values of the same column for the same date and time), and the data fit in memory -- and you prefer to work in Python than in Pandas (I notice you haven't tagged your Q with pandas ) -- one approach might be the following:

import collections
import csv

def merge_csvs(*filenames):
    result_dict = collections.defaultdict(dict)
    all_columns = set()
    for fn in filenames:
        with open(fn) as f:
            dr = csv.DictReader(f)
            update_cols = True
            for row in dr:
                date = row.pop('Date')
                time = row.pop('Time')
                result_dict[date, time].update(row)
                if update_cols:
                    all_columns.update(row)
                    update_cols = False
    for d in result_dict:
        missing_cols = all_columns.difference(d)
        d.update(dict.from_keys(missing_cols, '')
    return result_dict

This produced a dictionary, keyed by (date, time) pairs, of dictionaries whose keys are all the columns found in any of the input CSVs, with either the corresponding value for that date and time, or else an empty string if that column was never found for that date and time.

Now you can deal with this as you wish, eg

d = merge_csvs('a.csv', 'b.csv', 'c.csv')
for date, time in sorted(d):
    dd = d[date, time]
    outlist = [dd[c] for c in sorted(dd)]
    print(date, time, outlist)

or, of course, write it back to a different CSV, and so forth.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM