I have two DataFrames df_old
and df_new
with same columns x
(=identifier or index) and y
(=data).
Now I want to overwrite data in df_old
which is available in df_new
+ add the data from df_new
that is not in df_old
, so basically an outer merge on x
with overwrite of y
.
I tried it with pandas.DataFrame.merge
and pandas.DataFrame.update
but I am not able to achieve the desired result in one line or without doing row wise computations.
Example:
x = np.array(range(0, 10))
y = np.array(range(0, 10))
df_old = pd.DataFrame(data={'x':x,'y':y})
x = np.array(range(5, 15))
y = np.array(range(0, 10))
df_new = pd.DataFrame(data={'x':x,'y':y})
x = np.array(range(0, 15))
y = np.append(np.array(range(0, 5)), np.array(range(0, 10)))
df_desired = pd.DataFrame(data={'x':x,'y':y})
EDIT: The focus of the solution should be on execution time and memory efficiency. Simple solutions eg a one-liner would be nice to have.
So I would use pd.merge
as you have suggested, and then process the data to overwrite where possible.
Using your example:
df_merged = pd.merge(df_old, df_new, on="x", how="outer")
>>> df_merged
x y_x y_y
0 0 0.0 NaN
1 1 1.0 NaN
2 2 2.0 NaN
3 3 3.0 NaN
4 4 4.0 NaN
5 5 5.0 0.0
6 6 6.0 1.0
7 7 7.0 2.0
8 8 8.0 3.0
9 9 9.0 4.0
10 10 NaN 5.0
11 11 NaN 6.0
12 12 NaN 7.0
13 13 NaN 8.0
14 14 NaN 9.0
Then create a new y column and use the data you would like to overwrite the old data with where applicable:
df_merged["y"] = [y if pd.notna(y) else x for x,y in zip(df_merged["y_x"], df_merged["y_y"])]
>>> df_merged
x y_x y_y y
0 0 0.0 NaN 0.0
1 1 1.0 NaN 1.0
2 2 2.0 NaN 2.0
3 3 3.0 NaN 3.0
4 4 4.0 NaN 4.0
5 5 5.0 0.0 0.0
6 6 6.0 1.0 1.0
7 7 7.0 2.0 2.0
8 8 8.0 3.0 3.0
9 9 9.0 4.0 4.0
10 10 NaN 5.0 5.0
11 11 NaN 6.0 6.0
12 12 NaN 7.0 7.0
13 13 NaN 8.0 8.0
14 14 NaN 9.0 9.0
Then just select the correct columns:
df_merged = df_merged[["x","y"]]
>>> df_merged
x y
0 0 0.0
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 0.0
6 6 1.0
7 7 2.0
8 8 3.0
9 9 4.0
10 10 5.0
11 11 6.0
12 12 7.0
13 13 8.0
14 14 9.0
I found a possible efficient solution as it's only using boolean maps but it's still ugly and I would have expected pandas to have a more convenient solution for such tasks:
df_desired = pd.merge(df_old, df_new, how='outer', on='x')
df_desired['y'] = df_desired.loc[df_desired['y_y'].isna(), 'y_x'].append(df_desired.loc[df_desired['y_y'].notna(), 'y_y'])
df_desired = df_desired[['x','y']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.