简体   繁体   English

如何使用 Python 和 Pandas 将具有相似和不同列的多个 CSV 文件合并为 1?

[英]How to consolidate multiple CSV files with similar and different columns into 1 using Python and Pandas?

I have 12 CSV files that I am trying to consolidate into one CSV file.我有 12 个 CSV 文件,我试图将它们合并到一个 CSV 文件中。 In these 12 files there is one column, SendID , that is in every single one of these files.在这 12 个文件中,每一个文件中都有一列SendID SendID is unique and should not be duplicated on the final merged CSV file. SendID是唯一的,不应在最终合并的 CSV 文件中重复。 For example, four of my 12 CSV files have these columns.例如,我的 12 个 CSV 文件中有四个具有这些列。

(File 1: A,B,C,D,E), (File 2: A,C,F,H,K), (File 3: A,B,D,H,L), (File 4: A,D,H,N,Q)

So column A is present in every single CSV file and acts as a unique identifying column or primary key that should not repeat itself in the final CSV file.因此,A 列存在于每个 CSV 文件中,并作为唯一的标识列或主键,不应在最终的 CSV 文件中重复。 There are also instances where the same column may appear in multiple CSV files, these columns will carry the same value within each file if they are connected by the same SendID (Or Column A as listed in the example above.)在某些情况下,同一列可能会出现在多个 CSV 文件中,如果它们通过相同的SendID连接(或上例中列出的 A 列),这些列将在每个文件中携带相同的值。

The files may also have distinct columns that are only present in a singular CSV file and not present in any other file, again, this column would be attached to the final aggregate row on the SendID primary key column.这些文件还可能具有仅存在于单个 CSV 文件中且不存在于任何其他文件中的不同列,同样,此列将附加到SendID主键列的最终聚合行。 There also may be some rows within a column, which do not carry a value for every single SendID record across the many CSV files.列中也可能有一些行,它们不会为许多 CSV 文件中的每条SendID记录携带一个值。 So one row, based on a unique SendID may have a value for column K but not column Q. In which case the value of column Q would then be NULL or empty for that record.因此,基于唯一SendID的一行可能具有 K 列的值,但 Q 列不具有值。在这种情况下,Q 列的值将是 NULL 或该记录为空。

How can I use Python and Pandas to turn these 12 CSV files into one final CSV file, that will contain no duplicate SendID records?我如何使用 Python 和 Pandas 将这 12 个 CSV 文件转换为一个最终的SendID文件,该文件将不包含重复的记录? While also being able to attach all the various columns across the different files to the SendID Primary Key, forming one aggregated row per unique SendID record and of course not create duplicates of the same column that may appear in multiple CSV files.同时还能够将不同文件中的所有不同列附加到SendID主键,为每个唯一的SendID记录形成一个聚合行,当然不会创建可能出现在多个 CSV 文件中的同一列的重复项。 My apologies in advance as I know this is a bit verbose, but I am still very new to python and am trying to learn as much as I can.我提前道歉,因为我知道这有点冗长,但我对 python 仍然很陌生,并且正在尝试尽可能多地学习。

Suppose you have this data frame假设你有这个数据框

import pandas as pd

df1 = pd.DataFrame([{'A':'1', 'B':'2'}])
df2 = pd.DataFrame([{'A':'1', 'C':'3'}, {'A':'2', 'C':'4'}])

Now, if you want to merge these two on the basis of column A ie SendID, you can do something like this现在,如果你想在 A 列即 SendID 的基础上合并这两个,你可以做这样的事情

df1.merge(df2, on='A', how='outer').drop_duplicates()

It will result in the merged file like;它将导致合并的文件,例如;

在此处输入图像描述

So, it will not contain duplicate records.因此,它不会包含重复的记录。 Also, attach various columns with the same Primary Key forming a unique record.此外,附加具有相同主键的各个列,形成唯一记录。

df1 = pd.DataFrame(columns=list('ABCDE'))
df2 = pd.DataFrame(columns=list('ACFHK'))
df3 = pd.DataFrame(columns=list('ABDHL'))
df4 = pd.DataFrame(columns=list('ADHNQ'))


df_list = [df1, df2, df3, df4]
# rename every column with subfix _1, _2, _3, _4, except the uniqueID row 'A'
for i, df in enumerate(df_list):
    subfix = i+1
    df.columns = ['A'] + (df.columns[1:] + '_%s' % subfix).tolist()

# outer merge every df, on uniqueID row 'A'
dfn = df_list[0]
for df in df_list[1:]:
    dfn = pd.merge(dfn, df, on='A', how='outer')


# find the same column name dict
obj_col = pd.Series(dfn.columns).to_frame()
obj_col['col'] = obj_col[0].str.rsplit('_', 1).str[0]

# remove the uniqueID row
cond = obj_col['col'] == 'A'
obj_col = obj_col[~cond]
obj_col = obj_col.groupby('col')[0].agg(list)
col_dict = obj_col.to_dict()
col_dict

# {'B': ['B_1', 'B_3'],
#  'C': ['C_1', 'C_2'],
#  'D': ['D_1', 'D_3', 'D_4'],
#  'E': ['E_1'],
#  'F': ['F_2'],
#  'H': ['H_2', 'H_3', 'H_4'],
#  'K': ['K_2'],
#  'L': ['L_3'],
#  'N': ['N_4'],
#  'Q': ['Q_4']}

# combine the same column's content with combine_first
for col, colums in col_dict.items():
    dfn[col] = dfn[colums[0]]
    for i in colums[1:]:
        dfn[col] = dfn[col].combine_first(dfn[i])

# result
cols = ['A'] + list(col_dict.keys())
result = dfn[cols].copy()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM