简体   繁体   中英

how to map two rows of different dataframe based on a condition in pandas

I have two dataframes,

df1,

 Names
 one two three
 Sri is a good player
 Ravi is a mentor
 Kumar is a cricketer player

df2,

 values
 sri
 NaN
 sri, is
 kumar,cricketer player

I am trying to get the row in df1 which contains the all the items in df2

My expected output is,

 values                  Names
 sri                     Sri is a good player
 NaN
 sri, is                 Sri is a good player
 kumar,cricketer player  Kumar is a cricketer player

i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist())) I also tried,

but I cannot achieve my expected output as it has (","). Please help

Using set logic with Numpy broadcasting.

d1 = df1['Names'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values
d2 = df2['values'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values

i, j = np.where(d1 >= d2[:, None])

df2.assign(Names=pd.Series(df1['Names'].values[j], df2['values'].index[i]))

                   values                        Names
0                     sri         Sri is a good player
1                     NaN                          NaN
2                 sri, is         Sri is a good player
3  kumar,cricketer player  Kumar is a cricketer player

Try -

import pandas as pd

df1 = pd.read_csv('sample.csv')
df2 = pd.read_csv('sample_2.csv')

df2['values']= df2['values'].str.lower()
df1['names']= df1['names'].str.lower()

df2["values"] = df2['values'].str.replace('[^\w\s]',' ')
df2['values']= df2['values'].replace('\s+', ' ', regex=True)

df1["names"] = df1['names'].str.replace('[^\w\s]',' ')
df1['names']= df1['names'].replace('\s+', ' ', regex=True)

df2['list_values'] = df2['values'].apply(lambda x: str(x).split())
df1['list_names'] = df1['names'].apply(lambda x: str(x).split())

list_names = df1['list_names'].tolist()

def check_names(x, list_names):
    output = ''
    for list_name in list_names:
        if set(list_name) >= set(x):
            output = ' '.join(list_name)
            break
    return output

df2['Names'] = df2['list_values'].apply(lambda x: check_names(x, list_names))
print(df2)

Output

values                        Names
0                     sri         sri is a good player
1                     NaN                             
2                  sri is         sri is a good player
3  kumar cricketer player  kumar is a cricketer player

Exaplanation

It's a fuzzy matching problem. So here are the steps that I have applied -

  1. Remove punctuations and split to get unique words on both df
  2. Lowercase everything for standardized matching.
  3. Convert by splitting the string into lists.
  4. Finally doing the matching via the check_names() function to get the desired output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM