简体   繁体   English

数据帧之间的字符串匹配和分配

[英]String matching and assignment between data frames

I have two dataframes 我有两个数据框

(1st Dataframe)
**Sentences**
hello world
live in the world
haystack in the needle

(2nd Dataframe in descending order by Weight)
**Words**    **Weight**
world          80
hello          60
haystack       40
needle         20

I want to check each sentence in the 1st dataframe if any word in the sentence contains word listed in the 2nd dataframe and select word with the highest weight number. 我想检查第一个数据框中的每个句子,如果句子中的任何单词包含第二个数据框中列出的单词,然后选择权重最高的单词。 I will then assign the highest weight word found to the 1st dataframe. 然后,我将找到的权重最高的单词分配给第一个数据帧。 So the result should be: 因此结果应为:

**Sentence**                **Assigned Word**
hello world                   world
live in the world             world
needle in the haystack        haystack

I thought of using two for loops but the performance could be slow if having millions of sentence or words. 我考虑过使用两个for循环,但是如果有数百万个句子或单词,性能可能会很慢。 What is the best way to do this in python? 在python中执行此操作的最佳方法是什么? Thanks! 谢谢!

Cartesian Product --> Filter --> Sort --> groupby.head(1) 笛卡尔积->过滤器->排序-> groupby.head(1)

This method involves a few steps, but it's the best pandas-esque method I could think of. 这种方法涉及几个步骤,但这是我能想到的最好的熊猫式方法。

import pandas as pd
import numpy as np

list1 = ['hello world',
'live in the world',
'haystack in the needle']

list2 = [['world',80],
        ['hello',60],
        ['haystack',40],
        ['needle',20]]

df1 = pd.DataFrame(list1,columns=['Sentences'])
df2 = pd.DataFrame(list2,columns=['Words','Weight'])


# Creating a new column `Word_List` 
df1['Word_List'] = df1['Sentences'].apply(lambda x : x.split(' '))

# Need a common key for cartesian product
df1['common_key'] = 1
df2['common_key'] = 1

# Cartesian Product
df3 = pd.merge(df1,df2,on='common_key',copy=False)

# Filtering only words that matched
df3['Match'] = df3.apply(lambda x : x['Words'] in x['Word_List'] ,axis=1)
df3 = df3[df3['Match']]

# Sorting values by sentences and weight
df3.sort_values(['Sentences','Weight'],axis=0,inplace=True,ascending=False)

# Keeping only the first element in each group
final_df = df3.groupby('Sentences').head(1).reset_index()[['Sentences','Words']]
final_df

Output: Sentences Words 0 live in the world world 1 hello world world 2 haystack in the needle haystack 输出: Sentences Words 0 live in the world world 1 hello world world 2 haystack in the needle haystack

Performance: 10 loops, best of 3: 41.5 ms per loop 性能: 10 loops, best of 3: 41.5 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM