![](/img/trans.png)
[英]How to compare strings from 2 dataframes, and create new column containing matching words?
[英]How to compare two column words values from two dataframes, and create a new column containing matching/contained words?
**** 更新的问题 ****
我有这两个数据框:
DF1
id type
0 "car"
1 "travel"
2 "sport"
3 "Cleaning-bike"
4 "Build house"
5 "test 32 sport foot"
DF2
sentence_id sentence
0 "I love cars"
1 "I don't like traveling"
2 "I don't do sport and travel"
3 "I am on vacation"
4 "My bik needs more attention"
5 "I want a house"
6 "I know a lot about football"
我会查看 DF1 类型列中的每一行,如果这个词出现在 DF2 的任何行中,则在 DF2 中创建一个新列“类别”,其中包含出现在该行中的词(如果多个匹配如 ['sport' , 'travel']. 和 '如果没有从 DF1 中找到,则不匹配。
我怎样才能做到这一点? 无需使用 contains() 方法对每个循环进行操作。 DF1 有数千行和 DF2 百万。
有时DF1的类型与句子不完全匹配,但其中一个类别词包含在句子中(例如“foot”在id为6的句子中)而且有时句子不包含a的总词类型(例如'bik')。
预计 output:
sentence_id sentence category
0 'I love cars' 'car'
1 "I don't like traveling" 'travel'
2 "I don't do sport" ['sport', 'travel']
3 'I am on vacation' 'no match'
4 "My bik needs more attention" "Cleaning-bike"
5 "I want a house" "Build house"
6 "I know a lot about football" "test 32 sport foot"
您可以使用Series.str.findall
获取所有列表:
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'")))
print (DF2)
sentence_id sentence category
0 0 'I love cars' [car]
1 1 'I don't like traveling' [travel]
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' []
如果长度为1
还需要标量,如果空字符串则需要自定义字符串,则添加自定义 function:
f = lambda x: x[0] if len(x) == 1 else 'no match' if len(x) == 0 else x
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'"))).apply(f)
print (DF2)
sentence_id sentence category
0 0 'I love cars' car
1 1 'I don't like traveling' travel
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' no match
编辑:在DF1['type']
中用-
或空格创建字典,并在自定义 function 中匹配它:
s = DF1['type'].str.strip("'")
s = pd.Series(s.to_numpy(), index=s).str.split('-|\s+').explode().str.lower()
d = {v: k for k, v in s.items()}
print (d)
{'car': 'car',
'travel': 'travel',
'sport': 'sport',
'cleaning': 'Cleaning-bike',
'bike': 'Cleaning-bike',
'build': 'Build house',
'house': 'Build house'}
pat = '|'.join(s)
def f(x):
out = [d.get(y, y) for y in x]
if len(out) == 1:
return out[0]
elif not bool(x):
return 'no match'
else:
return out
DF2['category'] = DF2['sentence'].str.findall(pat).apply(f)
print (DF2)
sentence_id sentence category
0 0 'I love cars' car
1 1 'I don't like traveling' travel
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' no match
4 4 'My bike needs more attention' Cleaning-bike
5 5 'I want a house' Build house
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.