简体   繁体   中英

Pandas - How to match up rows if they contain the same value

I have a dataframe with 4 columns: 'age_1', 'name_1', 'age_2' and 'name_2'.

df = pd.DataFrame(index=[0, 4, 6, 9],
                  data={'age_1': [18, np.nan, 12, np.nan],
                        'name_1': ['Fred', np.nan, 'Harry', np.nan],
                        'age_2': [np.nan, 34, np.nan, 45],
                        'name_2': [np.nan, 'Jim', np.nan, 'Fred']})

Output

    age_1   name_1  age_2   name_2
0   18.0    Fred    NaN     NaN
4   NaN     NaN     34.0    Jim
6   12.0    Harry   NaN     NaN
9   NaN     NaN     45.0    Fred

All names appear twice (once in name_1 and once in name_2) I want to put the rows together where name_1 and name_2 have the same item in. For example from the snippet above, i want it to put the first and last row together like this:

    age_1   name_1  age_2   name_2
0   18.0    Fred    45.0    Fred

Any help would be great

you can split the dataframe into two parts and join them using merge. since the join columns name_1 & name_2 have nulls, you have to drop the nulls first.

l1 = ['age_1', 'name_1']
l2 = ['age_2', 'name_2']

df[l1].dropna().merge(df[l2].dropna(), left_on='name_1', right_on='name_2')

#outputs:
   age_1 name_1  age_2 name_2
0   18.0   Fred   45.0   Fred

If df is your dataframe:

df[["age_1", "name_1"]].dropna(how="all").join(df[["name_2", "age_2"]].dropna(how="all").set_index("name_2")[["age_2"]], on="name_1")

Will give you approximately what you're looking for (the name will not be repeated as in your example, since it's the key that's being joined on, it will appear just once).

Note this is a left join, any name_2 s that do not have corresponding name_1 s will be thrown away (however, name_1 s with no corresponding name_2 , like Harry , will remain). If you want to keep those name_2 s, just add how="outer" as as keyword argument to the join method. If you're sure that all names will always appear twice, then it won't matter.

If a name_1 has multiple name_2 s, the row will be repeated to accomodate as many name_2 s as it has. Again, if each name appears exactly twice (exactly once in the name_1 column and exactly once in the name_2 column), this won't matter. I would add a check for that like this:

# check that there are no repeats
for col in ("name_1", "name_2"):
    assert df[col].dropna().nunique() == df[col].dropna().shape[0]

# check that all `name_1`s have corresponding `name_2`s
assert set(df["name_1"].dropna()) == set(df["name_2"].dropna())

Edited: to add dropna's as suggest in comments

df= pd.DataFrame({'age_1':[18,'',12,''],'name_1':['Fred','','Harry',''],'age_2':['',34,'',45],'name_2':['','Jim','','Fred']})
df1=df[['age_1','name_1']]
df2=df[['age_2','name_2']]
df_new=df1.merge(df2,how='left',left_on='name_1',right_on='name_2' )
df_new=df_new.replace('',np.nan)
df_new.dropna(how='any',inplace =True)
df_new

Output

   age_1    name_1  age_2   name_2
0   18.0    Fred    45.0    Fred

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM