简体   繁体   中英

Append 2 pandas dataframes with subset of rows and columns

I have 2 dataframes like this

df = pd.DataFrame({"date":["2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04"],
                   "A": [1., 2., 3., 4.],
                   "B": ["a", "b", "c", "d"]})
df["date"] = pd.to_datetime(df["date"])

df_new = pd.DataFrame({"date":["2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06"],
                       "A": [2, 3.5, 4, 5., 6.],
                       "B": ["b", "c1", "d", "e", "f"]})
df_new["date"] = pd.to_datetime(df_new["date"])

So, my dataframes look like this

df
-----------------------
date            A    B
2019-01-01      1    a
2019-01-02      2    b
2019-01-03      3    c
2019-01-04      4    d

df_new
----------------------
date            A    B
2019-01-02      2    b
2019-01-03      3.5  c1
2019-01-04      4    d
2019-01-05      5    e
2019-01-06      6    f

From these dataframes, I would like to append df to df_new with specific condition as follows:

  1. Any row with date available in both dataframe, we take such rows in df_new

  2. Any row with date available in df but not in df_new, we take such rows in df

Finally my expected output look like this

Expected output
----------------------
date            A    B
2019-01-01      1    a      (take from df)
2019-01-02      2    b      (take from df_new)
2019-01-03      3.5  c1     (take from df_new)
2019-01-04      4    d      (take from df_new)
2019-01-05      5    e      (take from df_new)
2019-01-06      6    f      (take from df_new)

I can think of finding the row difference between 2 dataframes but it does not work when I take the date column into account. May I have your suggestions? Thank you.

Use concat and remove duplicates by date column by DataFrame.drop_duplicates , last create default uniqe index values by DataFrame.reset_index :

df = pd.concat([df, df_new]).drop_duplicates('date', keep='last').reset_index(drop=True)
print (df)
        date    A   B
0 2019-01-01  1.0   a
1 2019-01-02  2.0   b
2 2019-01-03  3.5  c1
3 2019-01-04  4.0   d
4 2019-01-05  5.0   e
5 2019-01-06  6.0   f

You can use pandas.DataFrame.merge with outer and then use drop_duplicates to remove duplicate rows.

df1 = df_new.merge(df, how='outer', on=['date','A','B']).sort_values(by='date').drop_duplicates('date').reset_index(drop=True)

print(df1)

Output:

    A   B       date                                                                                                                 
0  1.0   a 2019-01-01                                                                                                                 
1  2.0   b 2019-01-02                                                                                                                 
2  3.5  c1 2019-01-03                                                                                                                 
3  4.0   d 2019-01-04                                                                                                                 
4  5.0   e 2019-01-05                                                                                                                 
5  6.0   f 2019-01-06  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM