简体   繁体   中英

Upsert function in Dataframe - Python

I am trying to update one dataframe with another dataframe with respect to the first column. If there is an extra row in the second dataframe, it should be inserted in the first dataframe. It there is a row with the same data in the first column but different data in the other coulmns, that row should be updated. Also, the row which has no value in the first column should be dropped.

Code used -

    df = df_1.combine_first(df_2)\
          .reset_index()\
          .reindex(columns=df_1.columns)

    df = df.drop_duplicates(subset='A', keep= 'last', inplace=False)
    df.dropna(subset=['A'])
    print ("Final Data")
    print (df)

First Dataframe -

    A   B   C
0   45  a   b
1   98  c   d
2   67  bn  k

Second Dataframe -

    A   B   C
0   45  a   d
1   98  c   d
2   67  bn  k
3   90  x   z
4

Final should look like -

    A   B   C
0   45  a   d
1   98  c   d
2   67  bn  k
3   90  x   z

The final dataframe that I get -

      A      B  C
   0  45.0   a  b
   1  98.0   c  d
   2  67.0  bn  k
   3  90.0   x  z
   4

So, neither the data is getting updated, nor is it removing the row with null values. What am I missing?

Based on my understanding of your question, your second dataframe basically supercedes the first, if there is a matching index. If there isn't, then the difference is added to the first dataframe. I am also assuming that there are no duplicate keys in the first column, A .

Framing this requirement a little differently, the final output should contain all the rows in the second dataframe, as well as the values (since they are meant to overwrite the first dataframe if there's a match). Therefore, we will start off using the second dataframe as it is, and then add back the rows that exist in the first dataframe but not in the second. See the example below. (I'm also using a slightly different first dataframe to highlight the effects)

import pandas as pd


df1 = pd.DataFrame({'A':[45,98,67,91],'B':['a','c','bn','y'],'C':['b','d','k','oo']})
df2 = pd.DataFrame({'A':[45,98,67,90,''],'B':['a','c','bn','x',''],'C':['d','d','k','z','']})

# Remove rows with empty values in first column. This should be whatever conditions applicable to you i.e. checking for np.nan instead of str('')
df2 = df2.loc[df2['A'] != '']
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)

# Find keys in dataframe 1 that are not in dataframe 2
idx_diff = df1.index.difference(df2.index)
# Append these rows to dataframe 2
df_ins = df1.loc[idx_diff]
df3 = df2.append(df_ins)
df3.reset_index(inplace=True)

>>>df3
    A   B   C
0  45   a   d
1  98   c   d
2  67  bn   k
3  90   x   z
4  91   y  oo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM