简体   繁体   中英

Better way for merge (update\insert) pandas dataframes

I have 2 pandas data frames - df_current_data, df_new_data.

my goal is to apply a merge (not a pandas merge function, merge like 'update\\insert'). The check for a match is by key columns.

my result need to built by 3 optional rows-types.

  1. rows which exists in df_current_data but not exists in df_new_data - will insert "as is" to the result.

  2. rows which exists in df_new_data but not exists in df_current_data - will insert "as is" to the result.

  3. rows which exists in df_new_data and exists in df_current_data - the result need to take the rows from df_new_data.

This is a classic merge- upsert action.

example:

# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  3       3       6

# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]:    A  index1  index2
         0  4       2       7
         1  5       3       6
         2  6       4       5

# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)

In [43]: df_result
Out[43]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  5       3       6
         3  4       2       7
         4  6       4       5

That's what I did:

# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True)

# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']

# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])

another option is with isin function:

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\

.isin(df_new_source[p_key_col_name])], df_new_source])

The problem is that if I have more than 1 key-column I can't use the isin, I need merge.

Assuming that the current is much bigger from the new, I guess the best way is to directly update the matching rows of current by the rows of new, and append the new rows of the "new" data frame into the current.

But I'm not sure how to do that..

Thanks a lot.

Option 1: use indicator=True as part of merge :

df_out = df_current_source.merge(df_new_source, 
                                 on=['index1', 'index2'], 
                                 how='outer', indicator=True)

df_out['A'] = np.where(df_out['_merge'] == 'both',
                       df_out['A_y'],
                       df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)

df_out[['A', 'index1', 'index2']]

Output:

   A  index1  index2
0  1       1       4
1  2       2       5
2  5       3       6
3  4       2       7
4  6       4       5

Option 2: use combined_first with set_index

df_new_source.set_index(['index1', 'index2'])\
             .combine_first(df_current_source.set_index(['index1', 'index2']))\
             .reset_index()\
             .astype(int)

Output:

   index1  index2  A
0       1       4  1
1       2       5  2
2       2       7  4
3       3       6  5
4       4       5  6

Check this link join or merge with overwrite in pandas . You can use combine_first:

combined_dataframe = df_new_source.set_index('A').combine_first(df_current_source.set_index('A'))
combined_dataframe.reset_index()

Output

    A  index1  index2
 0  1   1.0    4.0
 1  2   2.0    5.0
 2  3   2.0    7.0
 3  5   3.0    6.0
 4  6   4.0    5.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM