Pandas - How to match up rows if they contain the same value

Question

I have a dataframe with 4 columns: 'age_1', 'name_1', 'age_2' and 'name_2'.

df = pd.DataFrame(index=[0, 4, 6, 9],
                  data={'age_1': [18, np.nan, 12, np.nan],
                        'name_1': ['Fred', np.nan, 'Harry', np.nan],
                        'age_2': [np.nan, 34, np.nan, 45],
                        'name_2': [np.nan, 'Jim', np.nan, 'Fred']})

Output

    age_1   name_1  age_2   name_2
0   18.0    Fred    NaN     NaN
4   NaN     NaN     34.0    Jim
6   12.0    Harry   NaN     NaN
9   NaN     NaN     45.0    Fred

All names appear twice (once in name_1 and once in name_2) I want to put the rows together where name_1 and name_2 have the same item in. For example from the snippet above, i want it to put the first and last row together like this:

    age_1   name_1  age_2   name_2
0   18.0    Fred    45.0    Fred

Any help would be great

Answer 1

you can split the dataframe into two parts and join them using merge. since the join columns name_1 & name_2 have nulls, you have to drop the nulls first.

l1 = ['age_1', 'name_1']
l2 = ['age_2', 'name_2']

df[l1].dropna().merge(df[l2].dropna(), left_on='name_1', right_on='name_2')

#outputs:
   age_1 name_1  age_2 name_2
0   18.0   Fred   45.0   Fred

Answer 2

If df is your dataframe:

df[["age_1", "name_1"]].dropna(how="all").join(df[["name_2", "age_2"]].dropna(how="all").set_index("name_2")[["age_2"]], on="name_1")

Will give you approximately what you're looking for (the name will not be repeated as in your example, since it's the key that's being joined on, it will appear just once).

Note this is a left join, any name_2 s that do not have corresponding name_1 s will be thrown away (however, name_1 s with no corresponding name_2 , like Harry , will remain). If you want to keep those name_2 s, just add how="outer" as as keyword argument to the join method. If you're sure that all names will always appear twice, then it won't matter.

If a name_1 has multiple name_2 s, the row will be repeated to accomodate as many name_2 s as it has. Again, if each name appears exactly twice (exactly once in the name_1 column and exactly once in the name_2 column), this won't matter. I would add a check for that like this:

# check that there are no repeats
for col in ("name_1", "name_2"):
    assert df[col].dropna().nunique() == df[col].dropna().shape[0]

# check that all `name_1`s have corresponding `name_2`s
assert set(df["name_1"].dropna()) == set(df["name_2"].dropna())

Edited: to add dropna's as suggest in comments

Answer 3

df= pd.DataFrame({'age_1':[18,'',12,''],'name_1':['Fred','','Harry',''],'age_2':['',34,'',45],'name_2':['','Jim','','Fred']})
df1=df[['age_1','name_1']]
df2=df[['age_2','name_2']]
df_new=df1.merge(df2,how='left',left_on='name_1',right_on='name_2' )
df_new=df_new.replace('',np.nan)
df_new.dropna(how='any',inplace =True)
df_new

Output

   age_1    name_1  age_2   name_2
0   18.0    Fred    45.0    Fred

Pandas - How to match up rows if they contain the same value

Question

3 answers

solution1
4 ACCPTED 2018-07-24 14:19:05

solution2
0 2018-07-24 14:12:24

solution3
0 2018-07-24 14:34:12

Pandas - How to match up rows if they contain the same value

Question

3 answers

solution1 4 ACCPTED 2018-07-24 14:19:05

solution2 0 2018-07-24 14:12:24

solution3 0 2018-07-24 14:34:12

solution1
4 ACCPTED 2018-07-24 14:19:05

solution2
0 2018-07-24 14:12:24

solution3
0 2018-07-24 14:34:12