简体   繁体   中英

String matching and assignment between data frames

I have two dataframes

(1st Dataframe)
**Sentences**
hello world
live in the world
haystack in the needle

(2nd Dataframe in descending order by Weight)
**Words**    **Weight**
world          80
hello          60
haystack       40
needle         20

I want to check each sentence in the 1st dataframe if any word in the sentence contains word listed in the 2nd dataframe and select word with the highest weight number. I will then assign the highest weight word found to the 1st dataframe. So the result should be:

**Sentence**                **Assigned Word**
hello world                   world
live in the world             world
needle in the haystack        haystack

I thought of using two for loops but the performance could be slow if having millions of sentence or words. What is the best way to do this in python? Thanks!

Cartesian Product --> Filter --> Sort --> groupby.head(1)

This method involves a few steps, but it's the best pandas-esque method I could think of.

import pandas as pd
import numpy as np

list1 = ['hello world',
'live in the world',
'haystack in the needle']

list2 = [['world',80],
        ['hello',60],
        ['haystack',40],
        ['needle',20]]

df1 = pd.DataFrame(list1,columns=['Sentences'])
df2 = pd.DataFrame(list2,columns=['Words','Weight'])


# Creating a new column `Word_List` 
df1['Word_List'] = df1['Sentences'].apply(lambda x : x.split(' '))

# Need a common key for cartesian product
df1['common_key'] = 1
df2['common_key'] = 1

# Cartesian Product
df3 = pd.merge(df1,df2,on='common_key',copy=False)

# Filtering only words that matched
df3['Match'] = df3.apply(lambda x : x['Words'] in x['Word_List'] ,axis=1)
df3 = df3[df3['Match']]

# Sorting values by sentences and weight
df3.sort_values(['Sentences','Weight'],axis=0,inplace=True,ascending=False)

# Keeping only the first element in each group
final_df = df3.groupby('Sentences').head(1).reset_index()[['Sentences','Words']]
final_df

Output: Sentences Words 0 live in the world world 1 hello world world 2 haystack in the needle haystack

Performance: 10 loops, best of 3: 41.5 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM