Adding a new column to a Pandas DataFrame by comparing two columns to two similar columns in a different Dataframe

Question

I want to add a new column to Pandas DataFrame by taking the values in two of the columns and comparing both of them to values that appear in the same order in a different dataframe.

Example:

first_names = pd.Series(['john','jack','jean','jose'])
last_names = pd.Series(['bob','steve','carl','anthony'])

names1 = pd.DataFrame({'firstname': first_names, 'lastname':last_names})
names2 = pd.DataFrame({'firstname': first_names,"lastname":['bob','steve','carl','joshua']})

    firstname   lastname
0   john    bob
1   jack    steve
2   jean    carl
3   jose    anthony


    firstname   lastname
0   john    bob
1   jack    steve
2   jean    carl
3   jose    joshua

I want to add the column 'real' to names2 and fill it with True if the firstname and last combination is in names1 and False otherwise.

Here's my attempt:

def verify(first,last):
  if names1.loc[ (names1['firstname'].str.contains(first)) & (names1['lastname'].str.contains(last)) , ['firstname','lastname'] ].empty:
    return False
  else:
    return True

names2['real'] = verify(names2['firstname'], names2['lastname']))

I get the frustrating error: TypeError: 'Series' objects are mutable, thus they cannot be hashed and it seems to be thrown at the following line inside the function verify :

names1.loc[ (names1['firstname'].str.contains(first)) & (names1['lastname'].str.contains(last)), ['firstname','lastname'] ].empty:

although it works ok when the function is called when direct values:

verify('jose','anthony')

returns True

which makes me think the values are not passed as strings

How to pass the values correctly to the above function? and Is there a more straightforward way to accomplish the comparison?

EDIT: I forgot to mention that the sizes of the dataframes in my case don't match. The datafarame names2 has more rows than names1. With names1 holding the lookup data and acting as the reference to check for real/fake first and last name combinations.

Answer 1

You can construct the "real" column using cross product between the two dataframes and then merge back to names1 :

tmp = names1.merge(names2, how="cross")
tmp["real"] = (tmp["firstname_x"] == tmp["firstname_y"]) & (
    tmp["lastname_x"] == tmp["lastname_y"]
)
df_out = names1.merge(
    tmp[tmp["real"] == True],
    left_on=["firstname", "lastname"],
    right_on=["firstname_x", "lastname_x"],
    how="left",
).fillna(False)[["firstname", "lastname", "real"]]
print(df_out)

Prints:

  firstname lastname   real
0      john      bob   True
1      jack    steve   True
2      jean     carl   True
3      jose  anthony  False

Adding a new column to a Pandas DataFrame by comparing two columns to two similar columns in a different Dataframe

Question

1 answers

solution1
0 ACCPTED 2021-06-07 17:55:51

Adding a new column to a Pandas DataFrame by comparing two columns to two similar columns in a different Dataframe

Question

1 answers

solution1 0 ACCPTED 2021-06-07 17:55:51

solution1
0 ACCPTED 2021-06-07 17:55:51