数据帧之间的字符串匹配和分配

Question

我有两个数据框

(1st Dataframe)
**Sentences**
hello world
live in the world
haystack in the needle

(2nd Dataframe in descending order by Weight)
**Words**    **Weight**
world          80
hello          60
haystack       40
needle         20

我想检查第一个数据框中的每个句子，如果句子中的任何单词包含第二个数据框中列出的单词，然后选择权重最高的单词。 然后，我将找到的权重最高的单词分配给第一个数据帧。 因此结果应为：

**Sentence**                **Assigned Word**
hello world                   world
live in the world             world
needle in the haystack        haystack

我考虑过使用两个for循环，但是如果有数百万个句子或单词，性能可能会很慢。 在python中执行此操作的最佳方法是什么？ 谢谢！

Answer 1

笛卡尔积->过滤器->排序-> `groupby.head(1)`

这种方法涉及几个步骤，但这是我能想到的最好的熊猫式方法。

import pandas as pd
import numpy as np

list1 = ['hello world',
'live in the world',
'haystack in the needle']

list2 = [['world',80],
        ['hello',60],
        ['haystack',40],
        ['needle',20]]

df1 = pd.DataFrame(list1,columns=['Sentences'])
df2 = pd.DataFrame(list2,columns=['Words','Weight'])


# Creating a new column `Word_List` 
df1['Word_List'] = df1['Sentences'].apply(lambda x : x.split(' '))

# Need a common key for cartesian product
df1['common_key'] = 1
df2['common_key'] = 1

# Cartesian Product
df3 = pd.merge(df1,df2,on='common_key',copy=False)

# Filtering only words that matched
df3['Match'] = df3.apply(lambda x : x['Words'] in x['Word_List'] ,axis=1)
df3 = df3[df3['Match']]

# Sorting values by sentences and weight
df3.sort_values(['Sentences','Weight'],axis=0,inplace=True,ascending=False)

# Keeping only the first element in each group
final_df = df3.groupby('Sentences').head(1).reset_index()[['Sentences','Words']]
final_df

输出： Sentences Words 0 live in the world world 1 hello world world 2 haystack in the needle haystack

性能： 10 loops, best of 3: 41.5 ms per loop

数据帧之间的字符串匹配和分配

问题描述

1 个解决方案

解决方案1
0 2016-11-19 02:28:39

笛卡尔积->过滤器->排序-> `groupby.head(1)`

数据帧之间的字符串匹配和分配

问题描述

1 个解决方案

解决方案1 0 2016-11-19 02:28:39

笛卡尔积->过滤器->排序-> groupby.head(1)

解决方案1
0 2016-11-19 02:28:39

笛卡尔积->过滤器->排序-> `groupby.head(1)`