繁体   English   中英

如何比较两个数据框中的两列单词值,并创建一个包含匹配/包含单词的新列?

[英]How to compare two column words values from two dataframes, and create a new column containing matching/contained words?

**** 更新的问题 ****

我有这两个数据框:

DF1

id type 
0  "car"
1  "travel"
2  "sport"
3  "Cleaning-bike"
4  "Build house"
5  "test 32 sport foot"

DF2

sentence_id sentence
0           "I love cars"
1           "I don't like traveling"
2           "I don't do sport and travel"
3           "I am on vacation"
4           "My bik needs more attention"
5           "I want a house"
6           "I know a lot about football"

我会查看 DF1 类型列中的每一行,如果这个词出现在 DF2 的任何行中,则在 DF2 中创建一个新列“类别”,其中包含出现在该行中的词(如果多个匹配如 ['sport' , 'travel']. 和 '如果没有从 DF1 中找到,则不匹配。

我怎样才能做到这一点? 无需使用 contains() 方法对每个循环进行操作。 DF1 有数千行和 DF2 百万。

有时DF1的类型与句子不完全匹配,但其中一个类别词包含在句子中(例如“foot”在id为6的句子中)而且有时句子不包含a的总词类型(例如'bik')。

预计 output:

sentence_id     sentence                        category
    0           'I love cars'                   'car'
    1           "I don't like traveling"        'travel'
    2           "I don't do sport"              ['sport', 'travel']
    3           'I am on vacation'              'no match'
    4           "My bik needs more attention"  "Cleaning-bike"
    5           "I want a house"                "Build house"
    6           "I know a lot about football"   "test 32 sport foot"

您可以使用Series.str.findall获取所有列表:

DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'")))
print (DF2)
   sentence_id                       sentence         category
0            0                  'I love cars'            [car]
1            1       'I don't like traveling'         [travel]
2            2  'I don't do sport and travel'  [sport, travel]
3            3             'I am on vacation'               []

如果长度为1还需要标量,如果空字符串则需要自定义字符串,则添加自定义 function:

f = lambda x: x[0] if len(x) == 1 else 'no match' if len(x) == 0 else x
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'"))).apply(f)


print (DF2)
   sentence_id                       sentence         category
0            0                  'I love cars'              car
1            1       'I don't like traveling'           travel
2            2  'I don't do sport and travel'  [sport, travel]
3            3             'I am on vacation'         no match

编辑:在DF1['type']中用-或空格创建字典,并在自定义 function 中匹配它:

s = DF1['type'].str.strip("'")

s = pd.Series(s.to_numpy(), index=s).str.split('-|\s+').explode().str.lower()
d = {v: k for k, v in s.items()}
print (d)
{'car': 'car', 
 'travel': 'travel', 
 'sport': 'sport',
 'cleaning': 'Cleaning-bike',
 'bike': 'Cleaning-bike',
 'build': 'Build house', 
 'house': 'Build house'}

pat = '|'.join(s)

def f(x):
    out = [d.get(y, y) for y in x]
    if len(out) == 1:
        return out[0]
    elif not bool(x):
        return 'no match'
    else:
        return out

DF2['category'] = DF2['sentence'].str.findall(pat).apply(f)
print (DF2)
   sentence_id                        sentence         category
0            0                   'I love cars'              car
1            1        'I don't like traveling'           travel
2            2   'I don't do sport and travel'  [sport, travel]
3            3              'I am on vacation'         no match
4            4  'My bike needs more attention'    Cleaning-bike
5            5                'I want a house'      Build house

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM