简体   繁体   中英

How do you merge dataframes in pandas with different shapes?

I am trying to merge two dataframes in pandas with large sets of data, however it is causing me some problems. I will try to illustrate with a smaller example.

df1 has a list of equipment and several columns relating to the equipment:

Item ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement Mixer  Tim   Active   New York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong Kong
5       Winch         Kojo  Inactive Sydney
6       Circular Saw  Alex  Active   Moscow

df2 has a list of instances where equipment has been used. This has some similar columns to df1, however some of the fields are NaN values and also instances of equipment not in df1 have also been recorded:

Item ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement Mixer  NaN   29/02/2020 New York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt

The resulting dataframe I was hoping to end up with was this:

Item ID Equipment     Owner Status   Date       Location
1       Jackhammer    James Active   08/09/2020 London
1       Jackhammer    James Active   08/10/2020 London
2       Cement Mixer  Tim   Active   29/02/2020 New York
3       Drill         Sarah Active   11/02/2020 Paris
3       Drill         Sarah Active   30/11/2020 Paris
3       Drill         Sarah Active   21/12/2020 Paris
4       Ladder        Luke  Inactive NaN        Hong Kong
5       Winch         Kojo  Inactive NaN        Sydney
6       Circular Saw  Alex  Active   19/06/2020 Moscow
7       Hammer        Ken   NaN      21/12/2020 Toronto
8       Sander        Ezra  NaN      19/06/2020 Frankfurt

Instead, with the following code I'm getting duplicate rows, I think because of the NaN values:

data = pd.merge(df1, df2, how='outer', on=['Item ID'])

Item ID Equipment_x  Equipment_y Owner_x Owner_y Status   Date       Location_x  Location_y
1       Jackhammer   NaN         James   James   Active   08/09/2020 London      London
1       Jackhammer   NaN         James   James   Active   08/10/2020 London      London
2       Cement Mixer NaN         Tim     NaN     Active   29/02/2020 New York    New York
3       Drill        NaN         Sarah   Sarah   Active   11/02/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   30/11/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   21/12/2020 Paris       NaN
4       Ladder       NaN         Luke    NaN     Inactive NaN        Hong Kong   Hong Kong
5       Winch        NaN         Kojo    NaN     Inactive NaN        Sydney      Sydney
6       Circular Saw NaN         Alex    NaN     Active   19/06/2020 Moscow      Moscow
7       NaN          Hammer      NaN     Ken     NaN      21/12/2020 NaN         Toronto
8       NaN          Sander      NaN     Ezra    NaN      19/06/2020 NaN         Frankfurt

Ideally I could just drop the _y columns however the data in the bottom rows means I would be losing important information. Instead the only thing I can think of merging the columns and force pandas to compare the values in each column and always favour the non-NaN value. I'm not sure if this is possible or not though?

merging the columns and force pandas to compare the values in each column and always favour the non-NaN value.

Is this what you mean?

In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])                         

In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])                            

In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])                                     

In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)                

In [49]: data                                                                                        
Out[49]: 
    Item ID     Equipment    Status        Date   Location  Owner
0         1    Jackhammer    Active  08/09/2020     London  James
1         1    Jackhammer    Active  08/10/2020     London  James
2         2  Cement Mixer    Active  29/02/2020   New York    Tim
3         3         Drill    Active  11/02/2020      Paris  Sarah
4         3         Drill    Active  30/11/2020      Paris  Sarah
5         3         Drill    Active  21/12/2020      Paris  Sarah
6         4        Ladder  Inactive         NaN  Hong Kong   Luke
7         5         Winch  Inactive         NaN     Sydney   Kojo
8         6  Circular Saw    Active  19/06/2020     Moscow   Alex
9         7        Hammer       NaN  21/12/2020    Toronto    Ken
10        8        Sander       NaN  19/06/2020  Frankfurt   Ezra

(To my knowledge) you cannot really merge on null column. However you can use fillna to take the value and replace it by something else if it is NaN . Not a very elegant solution, but it seems to solve your example at least.

Also see pandas combine two columns with null values

Generically you can do that as follows:

# merge the two dataframes using a suffix that ideally does
# not appear in your data    
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
    second_col= f'{col}{suffix_string}'
    if second_col in data.columns:
        data[col]= data[second_col].where(data[col].isna(), data[col])
        columns_to_remove.append(second_col)
if columns_to_remove:
    data.drop(columns=columns_to_remove, inplace=True)
data

The result is:

    Item_ID     Equipment  Owner    Status   Location        Date
0         1    Jackhammer  James    Active     London  08/09/2020
1         1    Jackhammer  James    Active     London  08/10/2020
2         2  Cement_Mixer    Tim    Active   New_York  29/02/2020
3         3         Drill  Sarah    Active      Paris  11/02/2020
4         3         Drill  Sarah    Active      Paris  30/11/2020
5         3         Drill  Sarah    Active      Paris  21/12/2020
6         4        Ladder   Luke  Inactive  Hong_Kong         NaN
7         5         Winch   Kojo  Inactive     Sydney         NaN
8         6  Circular_Saw   Alex    Active     Moscow  19/06/2020
9         7        Hammer    Ken       NaN    Toronto  21/12/2020
10        8        Sander   Ezra       NaN  Frankfurt  19/06/2020

On the following test data:

df1= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement_Mixer  Tim   Active   New_York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong_Kong
5       Winch         Kojo  Inactive Sydney
6       Circular_Saw  Alex  Active   Moscow"""), sep='\s+')


df2= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement_Mixer  NaN   29/02/2020 New_York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular_Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt"""), sep='\s+')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM