合並（更新\\插入）熊貓數據框的更好方法

Question

我有2個熊貓數據幀-df_current_data，df_new_data。

我的目標是應用合並（不是熊貓合並功能，像“ update \\ insert”那樣合並）。 是否匹配是通過關鍵列進行的。

我的結果需要由3個可選的行類型構建。

df_current_data中存在但df_new_data中不存在的行-將“按原樣”插入結果。
df_new_data中存在但df_current_data中不存在的行-將“按原樣”插入結果。
df_new_data中存在且df_current_data中存在的行-結果需要從df_new_data中獲取行。

這是經典的合並更新操作。

例：

# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  3       3       6

# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]:    A  index1  index2
         0  4       2       7
         1  5       3       6
         2  6       4       5

# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)

In [43]: df_result
Out[43]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  5       3       6
         3  4       2       7
         4  6       4       5

那就是我所做的：

# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True)

# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']

# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])

另一個選項是使用isin函數：

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\

.isin(df_new_source[p_key_col_name])], df_new_source])

問題是，如果我有多個鍵列，我不能使用isin，則需要合並。

假設當前的電流比新的電流大得多，我猜最好的方法是直接用new的行更新當前匹配的行，並將“ new”數據幀的新行追加到當前中。

但我不確定該怎么做。

非常感謝。

Answer 1

選項1：將`indicator=True`用作`merge`一部分：

df_out = df_current_source.merge(df_new_source, 
                                 on=['index1', 'index2'], 
                                 how='outer', indicator=True)

df_out['A'] = np.where(df_out['_merge'] == 'both',
                       df_out['A_y'],
                       df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)

df_out[['A', 'index1', 'index2']]

輸出：

   A  index1  index2
0  1       1       4
1  2       2       5
2  5       3       6
3  4       2       7
4  6       4       5

選項2：將`combined_first`與`set_index`一起`set_index`

df_new_source.set_index(['index1', 'index2'])\
             .combine_first(df_current_source.set_index(['index1', 'index2']))\
             .reset_index()\
             .astype(int)

輸出：

   index1  index2  A
0       1       4  1
1       2       5  2
2       2       7  4
3       3       6  5
4       4       5  6

Answer 2

檢查此鏈接加入或與pandas中的覆蓋合並。 您可以使用Combine_first：

combined_dataframe = df_new_source.set_index('A').combine_first(df_current_source.set_index('A'))
combined_dataframe.reset_index()

產量

    A  index1  index2
 0  1   1.0    4.0
 1  2   2.0    5.0
 2  3   2.0    7.0
 3  5   3.0    6.0
 4  6   4.0    5.0

合並（更新\\插入）熊貓數據框的更好方法

問題描述

2 個解決方案

解決方案1
0 2017-08-21 13:09:18

選項1：將`indicator=True`用作`merge`一部分：

選項2：將`combined_first`與`set_index`一起`set_index`

解決方案2
0 2017-08-21 13:52:08

合並（更新\\插入）熊貓數據框的更好方法

問題描述

2 個解決方案

解決方案1 0 2017-08-21 13:09:18

選項1：將indicator=True用作merge一部分：

選項2：將combined_first與set_index一起set_index

解決方案2 0 2017-08-21 13:52:08

解決方案1
0 2017-08-21 13:09:18

選項1：將`indicator=True`用作`merge`一部分：

選項2：將`combined_first`與`set_index`一起`set_index`

解決方案2
0 2017-08-21 13:52:08