I am trying to merge two dataframes in pandas with large sets of data, however it is causing me some problems. I will try to illustrate with a smaller example.
df1 has a list of equipment and several columns relating to the equipment:
Item ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement Mixer Tim Active New York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong Kong
5 Winch Kojo Inactive Sydney
6 Circular Saw Alex Active Moscow
df2 has a list of instances where equipment has been used. This has some similar columns to df1, however some of the fields are NaN values and also instances of equipment not in df1 have also been recorded:
Item ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement Mixer NaN 29/02/2020 New York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt
The resulting dataframe I was hoping to end up with was this:
Item ID Equipment Owner Status Date Location
1 Jackhammer James Active 08/09/2020 London
1 Jackhammer James Active 08/10/2020 London
2 Cement Mixer Tim Active 29/02/2020 New York
3 Drill Sarah Active 11/02/2020 Paris
3 Drill Sarah Active 30/11/2020 Paris
3 Drill Sarah Active 21/12/2020 Paris
4 Ladder Luke Inactive NaN Hong Kong
5 Winch Kojo Inactive NaN Sydney
6 Circular Saw Alex Active 19/06/2020 Moscow
7 Hammer Ken NaN 21/12/2020 Toronto
8 Sander Ezra NaN 19/06/2020 Frankfurt
Instead, with the following code I'm getting duplicate rows, I think because of the NaN values:
data = pd.merge(df1, df2, how='outer', on=['Item ID'])
Item ID Equipment_x Equipment_y Owner_x Owner_y Status Date Location_x Location_y
1 Jackhammer NaN James James Active 08/09/2020 London London
1 Jackhammer NaN James James Active 08/10/2020 London London
2 Cement Mixer NaN Tim NaN Active 29/02/2020 New York New York
3 Drill NaN Sarah Sarah Active 11/02/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 30/11/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 21/12/2020 Paris NaN
4 Ladder NaN Luke NaN Inactive NaN Hong Kong Hong Kong
5 Winch NaN Kojo NaN Inactive NaN Sydney Sydney
6 Circular Saw NaN Alex NaN Active 19/06/2020 Moscow Moscow
7 NaN Hammer NaN Ken NaN 21/12/2020 NaN Toronto
8 NaN Sander NaN Ezra NaN 19/06/2020 NaN Frankfurt
Ideally I could just drop the _y columns however the data in the bottom rows means I would be losing important information. Instead the only thing I can think of merging the columns and force pandas to compare the values in each column and always favour the non-NaN value. I'm not sure if this is possible or not though?
merging the columns and force pandas to compare the values in each column and always favour the non-NaN value.
Is this what you mean?
In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])
In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])
In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])
In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)
In [49]: data
Out[49]:
Item ID Equipment Status Date Location Owner
0 1 Jackhammer Active 08/09/2020 London James
1 1 Jackhammer Active 08/10/2020 London James
2 2 Cement Mixer Active 29/02/2020 New York Tim
3 3 Drill Active 11/02/2020 Paris Sarah
4 3 Drill Active 30/11/2020 Paris Sarah
5 3 Drill Active 21/12/2020 Paris Sarah
6 4 Ladder Inactive NaN Hong Kong Luke
7 5 Winch Inactive NaN Sydney Kojo
8 6 Circular Saw Active 19/06/2020 Moscow Alex
9 7 Hammer NaN 21/12/2020 Toronto Ken
10 8 Sander NaN 19/06/2020 Frankfurt Ezra
(To my knowledge) you cannot really merge on null column. However you can use fillna
to take the value and replace it by something else if it is NaN
. Not a very elegant solution, but it seems to solve your example at least.
Generically you can do that as follows:
# merge the two dataframes using a suffix that ideally does
# not appear in your data
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
second_col= f'{col}{suffix_string}'
if second_col in data.columns:
data[col]= data[second_col].where(data[col].isna(), data[col])
columns_to_remove.append(second_col)
if columns_to_remove:
data.drop(columns=columns_to_remove, inplace=True)
data
The result is:
Item_ID Equipment Owner Status Location Date
0 1 Jackhammer James Active London 08/09/2020
1 1 Jackhammer James Active London 08/10/2020
2 2 Cement_Mixer Tim Active New_York 29/02/2020
3 3 Drill Sarah Active Paris 11/02/2020
4 3 Drill Sarah Active Paris 30/11/2020
5 3 Drill Sarah Active Paris 21/12/2020
6 4 Ladder Luke Inactive Hong_Kong NaN
7 5 Winch Kojo Inactive Sydney NaN
8 6 Circular_Saw Alex Active Moscow 19/06/2020
9 7 Hammer Ken NaN Toronto 21/12/2020
10 8 Sander Ezra NaN Frankfurt 19/06/2020
On the following test data:
df1= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement_Mixer Tim Active New_York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong_Kong
5 Winch Kojo Inactive Sydney
6 Circular_Saw Alex Active Moscow"""), sep='\s+')
df2= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement_Mixer NaN 29/02/2020 New_York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular_Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt"""), sep='\s+')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.