I want to add a new column to Pandas DataFrame by taking the values in two of the columns and comparing both of them to values that appear in the same order in a different dataframe.
Example:
first_names = pd.Series(['john','jack','jean','jose'])
last_names = pd.Series(['bob','steve','carl','anthony'])
names1 = pd.DataFrame({'firstname': first_names, 'lastname':last_names})
names2 = pd.DataFrame({'firstname': first_names,"lastname":['bob','steve','carl','joshua']})
firstname lastname
0 john bob
1 jack steve
2 jean carl
3 jose anthony
firstname lastname
0 john bob
1 jack steve
2 jean carl
3 jose joshua
I want to add the column 'real' to names2 and fill it with True if the firstname and last combination is in names1 and False otherwise.
Here's my attempt:
def verify(first,last):
if names1.loc[ (names1['firstname'].str.contains(first)) & (names1['lastname'].str.contains(last)) , ['firstname','lastname'] ].empty:
return False
else:
return True
names2['real'] = verify(names2['firstname'], names2['lastname']))
I get the frustrating error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
and it seems to be thrown at the following line inside the function verify :
names1.loc[ (names1['firstname'].str.contains(first)) & (names1['lastname'].str.contains(last)), ['firstname','lastname'] ].empty:
although it works ok when the function is called when direct values:
verify('jose','anthony')
returns True
which makes me think the values are not passed as strings
How to pass the values correctly to the above function? and Is there a more straightforward way to accomplish the comparison?
EDIT: I forgot to mention that the sizes of the dataframes in my case don't match. The datafarame names2 has more rows than names1. With names1 holding the lookup data and acting as the reference to check for real/fake first and last name combinations.
You can construct the "real"
column using cross product between the two dataframes and then merge back to names1
:
tmp = names1.merge(names2, how="cross")
tmp["real"] = (tmp["firstname_x"] == tmp["firstname_y"]) & (
tmp["lastname_x"] == tmp["lastname_y"]
)
df_out = names1.merge(
tmp[tmp["real"] == True],
left_on=["firstname", "lastname"],
right_on=["firstname_x", "lastname_x"],
how="left",
).fillna(False)[["firstname", "lastname", "real"]]
print(df_out)
Prints:
firstname lastname real
0 john bob True
1 jack steve True
2 jean carl True
3 jose anthony False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.