简体   繁体   English

合并具有重叠行和不同列的多个数据框

[英]Merging multiple dataframes with overlapping rows and different columns

I have multiple pandas data frames with some common columns and some overlapping rows.我有多个带有一些公共列和一些重叠行的 Pandas 数据框。 I would like to combine them in such a way that I have one final data frame with all of the columns and all of the unique rows (overlapping/duplicate rows dropped).我想以这样一种方式组合它们,即我有一个包含所有列和所有唯一行(重叠/重复行删除)的最终数据框。 The remaining gaps should be nans.剩下的差距应该是nans。

在此处输入图片说明

I have come up with the function below.我想出了下面的功能。 In essence it goes through all columns one by one, appending all of the values from each data frame, dropping the duplicates (overlap), and building a new output data frame column by column.本质上,它会一一遍历所有列,附加每个数据帧中的所有值,删除重复项(重叠),并逐列构建新的输出数据帧。

def combine_dfs(dataframes:list):
    
    ## Identifying all unique columns in all data frames
    columns = []
    for df in dataframes:
        columns.extend(df.columns)
    columns = np.unique(columns)
    
    ## Appending values from each data frame per column
    output_df = pd.DataFrame()
    for col in columns:
        column = pd.Series(dtype="object", name=col)
        for df in dataframes:
            if col in df.columns:
                column = column.append(df[col])
        
        ## Removing overlapping data (assuming consistent values)
        column = column[~column.index.duplicated()]
        
        ## Adding column to output data frame
        column = pd.DataFrame(column)
        output_df = pd.concat([output_df,column], axis=1)
    
    output_df.sort_index(inplace=True)
    return output_df

df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])

print(combine_dfs([df_1,df_2,df_3]))

The output, as intended in the visualization, looks like this:正如可视化中的预期,输出如下所示:

      A     B   C     D   E     F
0  10.0  20.0  30   NaN  50  60.0
1  11.0  21.0  31   NaN  51  61.0
2  12.0  22.0  32   NaN  52  62.0
3  13.0  23.0  33  43.0  54  63.0
4   NaN   NaN  34  44.0  54  64.0
5   NaN   NaN  35  45.0  55   NaN
6   NaN   NaN  36  46.0  56   NaN

This method works well on small data sets.这种方法适用于小数据集。 Is there a way to optimize this?有没有办法优化这个?

IIUC you can chain combine_first : IIUC 你可以链接combine_first

print (df_1.combine_first(df_2).combine_first(df_3))

      A     B   C     D     E     F
0  10.0  20.0  30   NaN  50.0  60.0
1  11.0  21.0  31   NaN  51.0  61.0
2  12.0  22.0  32   NaN  52.0  62.0
3  13.0  23.0  33  43.0  54.0  63.0
4   NaN   NaN  34  44.0  54.0  64.0
5   NaN   NaN  35  45.0  55.0   NaN
6   NaN   NaN  36  46.0  56.0   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM