![](/img/trans.png)
[英]Merging and updating multiple pandas dataframes with overlapping columns
[英]Merging multiple dataframes with overlapping rows and different columns
我有多個帶有一些公共列和一些重疊行的 Pandas 數據框。 我想以這樣一種方式組合它們,即我有一個包含所有列和所有唯一行(重疊/重復行刪除)的最終數據框。 剩下的差距應該是nans。
我想出了下面的功能。 本質上,它會一一遍歷所有列,附加每個數據幀中的所有值,刪除重復項(重疊),並逐列構建新的輸出數據幀。
def combine_dfs(dataframes:list):
## Identifying all unique columns in all data frames
columns = []
for df in dataframes:
columns.extend(df.columns)
columns = np.unique(columns)
## Appending values from each data frame per column
output_df = pd.DataFrame()
for col in columns:
column = pd.Series(dtype="object", name=col)
for df in dataframes:
if col in df.columns:
column = column.append(df[col])
## Removing overlapping data (assuming consistent values)
column = column[~column.index.duplicated()]
## Adding column to output data frame
column = pd.DataFrame(column)
output_df = pd.concat([output_df,column], axis=1)
output_df.sort_index(inplace=True)
return output_df
df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])
print(combine_dfs([df_1,df_2,df_3]))
正如可視化中的預期,輸出如下所示:
A B C D E F
0 10.0 20.0 30 NaN 50 60.0
1 11.0 21.0 31 NaN 51 61.0
2 12.0 22.0 32 NaN 52 62.0
3 13.0 23.0 33 43.0 54 63.0
4 NaN NaN 34 44.0 54 64.0
5 NaN NaN 35 45.0 55 NaN
6 NaN NaN 36 46.0 56 NaN
這種方法適用於小數據集。 有沒有辦法優化這個?
IIUC 你可以鏈接combine_first
:
print (df_1.combine_first(df_2).combine_first(df_3))
A B C D E F
0 10.0 20.0 30 NaN 50.0 60.0
1 11.0 21.0 31 NaN 51.0 61.0
2 12.0 22.0 32 NaN 52.0 62.0
3 13.0 23.0 33 43.0 54.0 63.0
4 NaN NaN 34 44.0 54.0 64.0
5 NaN NaN 35 45.0 55.0 NaN
6 NaN NaN 36 46.0 56.0 NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.