简体   繁体   中英

How to consolidate multiple CSV files with similar and different columns into 1 using Python and Pandas?

I have 12 CSV files that I am trying to consolidate into one CSV file. In these 12 files there is one column, SendID , that is in every single one of these files. SendID is unique and should not be duplicated on the final merged CSV file. For example, four of my 12 CSV files have these columns.

(File 1: A,B,C,D,E), (File 2: A,C,F,H,K), (File 3: A,B,D,H,L), (File 4: A,D,H,N,Q)

So column A is present in every single CSV file and acts as a unique identifying column or primary key that should not repeat itself in the final CSV file. There are also instances where the same column may appear in multiple CSV files, these columns will carry the same value within each file if they are connected by the same SendID (Or Column A as listed in the example above.)

The files may also have distinct columns that are only present in a singular CSV file and not present in any other file, again, this column would be attached to the final aggregate row on the SendID primary key column. There also may be some rows within a column, which do not carry a value for every single SendID record across the many CSV files. So one row, based on a unique SendID may have a value for column K but not column Q. In which case the value of column Q would then be NULL or empty for that record.

How can I use Python and Pandas to turn these 12 CSV files into one final CSV file, that will contain no duplicate SendID records? While also being able to attach all the various columns across the different files to the SendID Primary Key, forming one aggregated row per unique SendID record and of course not create duplicates of the same column that may appear in multiple CSV files. My apologies in advance as I know this is a bit verbose, but I am still very new to python and am trying to learn as much as I can.

Suppose you have this data frame

import pandas as pd

df1 = pd.DataFrame([{'A':'1', 'B':'2'}])
df2 = pd.DataFrame([{'A':'1', 'C':'3'}, {'A':'2', 'C':'4'}])

Now, if you want to merge these two on the basis of column A ie SendID, you can do something like this

df1.merge(df2, on='A', how='outer').drop_duplicates()

It will result in the merged file like;

在此处输入图像描述

So, it will not contain duplicate records. Also, attach various columns with the same Primary Key forming a unique record.

df1 = pd.DataFrame(columns=list('ABCDE'))
df2 = pd.DataFrame(columns=list('ACFHK'))
df3 = pd.DataFrame(columns=list('ABDHL'))
df4 = pd.DataFrame(columns=list('ADHNQ'))


df_list = [df1, df2, df3, df4]
# rename every column with subfix _1, _2, _3, _4, except the uniqueID row 'A'
for i, df in enumerate(df_list):
    subfix = i+1
    df.columns = ['A'] + (df.columns[1:] + '_%s' % subfix).tolist()

# outer merge every df, on uniqueID row 'A'
dfn = df_list[0]
for df in df_list[1:]:
    dfn = pd.merge(dfn, df, on='A', how='outer')


# find the same column name dict
obj_col = pd.Series(dfn.columns).to_frame()
obj_col['col'] = obj_col[0].str.rsplit('_', 1).str[0]

# remove the uniqueID row
cond = obj_col['col'] == 'A'
obj_col = obj_col[~cond]
obj_col = obj_col.groupby('col')[0].agg(list)
col_dict = obj_col.to_dict()
col_dict

# {'B': ['B_1', 'B_3'],
#  'C': ['C_1', 'C_2'],
#  'D': ['D_1', 'D_3', 'D_4'],
#  'E': ['E_1'],
#  'F': ['F_2'],
#  'H': ['H_2', 'H_3', 'H_4'],
#  'K': ['K_2'],
#  'L': ['L_3'],
#  'N': ['N_4'],
#  'Q': ['Q_4']}

# combine the same column's content with combine_first
for col, colums in col_dict.items():
    dfn[col] = dfn[colums[0]]
    for i in colums[1:]:
        dfn[col] = dfn[col].combine_first(dfn[i])

# result
cols = ['A'] + list(col_dict.keys())
result = dfn[cols].copy()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM