I have Data Frame in Python Pandas like below:
import pandas as pd
import re
df = pd.DataFrame()
df["ADRESAT"] = ["Kowal Jan", "Nowak Adam PHU"]
df["NADAWCA"] = ["Jan Kowal", "Adam Nowak"]
And I had created 2 new columns:
col1 - value from column "NADAWCA" which is in column "ADRESAT"
col2 - rest of values (values from column "ADRESAT" beyon values which are also in column "NADAWCA")
df["col2"] = df.apply(lambda r: re.sub(r["NADAWCA"], '', r["ADRESAT"], flags = re.IGNORECASE).strip(), axis=1) df["col1"] = df["NADAWCA"].str.title()
Nevertheless, as a result I have df like below. But as you can see in second row there is a mistake.
My question: How to modify my code so as to recognize that Adam Nowak and Nowak Adam is the same value ?
I need result as below :
As the order does matter, using set
is not possible, So we need to check each word one by one:
# x[0] -> ADRESAT, x[1] -> NADAWCA
intersection = lambda x: ' '.join([x1 for x1 in x[1].split()
if x1.lower() in x[0].lower().split()])
difference = lambda x: ' '.join([x0 for x0 in x[0].split()
if not x0.lower() in x[1].lower().split()])
df['col1'] = df[['ADRESAT', 'NADAWCA']].apply(intersection, axis='columns')
df['col2'] = df[['ADRESAT', 'NADAWCA']].apply(difference, axis='columns')
>>> df
ADRESAT NADAWCA col1 col2
0 Kowal Jan Jan Kowal Jan Kowal
1 Nowak Adam PHU Adam Nowak Adam Nowak PHU
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.