I have been searching a long time for an answer but could not find it. I have two dataframes, one is target
, the other backup
which both have the same columns. What I want to do is to look at a given column and add all the rows from backup
to target
which are not in target
. The most straightforward solution for this is:
import pandas as pd
import numpy as np
target = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K5"],
"A": ["A1", "A2", "A3", np.nan],
"B": ["B1", "B2", "B3", "B5"],
})
backup = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K4", "K5"],
"A": ["A1", "A", "A3", "A4", "A5"],
"B": ["B1", "B2", "B3", "B4", "B5"],
})
merged = target.copy()
for item in backup.key1.unique():
if item not in target.key1.unique():
merged = pd.concat([merged, backup.loc[backup.key1 == item]])
merged.reset_index(drop=True, inplace=True)
giving
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
4 K4 A4 B4
Now I have tried several things using just pandas where none of them works.
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN
pd.concat([target, backup]).drop_duplicates()
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
1 K2 A B2
3 K4 A4 B4
4 K5 A5 B5
# Does not work because the backup would overwrite data in the target -- NaN
pd.merge(target, backup, how="right")
key1 A B
0 K1 A1 B1
1 K2 A B2
2 K3 A3 B3
3 K4 A4 B4
4 K5 A5 B5
Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not NaN
in target
, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, the NaN
in the target
would be replaced by the value in backup
which is unwanted.
It is not a duplicate of this post which uses the combine_first pandas because in that case the NaN
is filled by the value from the backup
which is wrong:
target.combine_first(backup)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 A4 B5
4 K5 A5 B5
target.join(backup, on=["key1"])
gives me an annoying
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
which I really do not get since both are pure strings and the proposed solution does not work.
So I would like to ask, what am I missing? How can I do it using some pandas
methods? Thanks a lot.
Use concat
with filtered backup
rows with not exist in target.key1
filtered by Series.isin
in boolean indexing
:
merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
Maybe you can try this with a 'subset' parameter in df.drop_duplicates()
?
pd.concat([target, backup]).drop_duplicates(subset = "key1")
which gives output:
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.