[英]how to match partial string from a text in pandas dataframe
我的数据框看起来像 -
id text
1 good,i am interested..please mail me.
2 call me...good to go with you
3 not interested...bye
4 i am not interested don't call me
5 price is too high so not interested
6 i have some requirement..please mail me
我希望数据框看起来像 -
id text is_relevant
1 good,i am interested..please mail me. yes
2 call me...good to go with you yes
3 not interested...bye no
4 i am nt interested don't call me no
5 price is too high so not interested no
6 i have some requirement..please mail me yes
我已经完成了以下代码 -
d1 = {'no': ['Not interested','nt interested']}
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
df["is_relevant"] = df['new_text'].map(d).fillna('yes')
In [20]: df = pd.read_csv("a.csv")
In [21]: a
Out[21]: ['not interested', 'nt interested']
In [22]: df
Out[22]:
id text
0 1 good i am interested..please mail me.
1 2 call me...good to go with you
2 3 not interested...bye
3 4 i am not interested don't call me
4 5 price is too high so not interested
5 6 i have some requirement..please mail me
In [23]: df["is_relevant"] = df["text"].apply(lambda x: "no" if (a[0] in x.lower() or a[1] in x.lower()) else "yes")
In [24]: df
Out[24]:
id text is_relevant
0 1 good i am interested..please mail me. yes
1 2 call me...good to go with you yes
2 3 not interested...bye no
3 4 i am not interested don't call me no
4 5 price is too high so not interested no
5 6 i have some requirement..please mail me yes
你可以做:
d1 = {'no': ['not interested','nt interested']}
# create regex
reg = '|'.join([f'\\b{x}\\b' for x in list(d1.values())[0]])
# apply function
df['is_relevant'] = df['text'].str.lower().str.contains(reg).map({True: 'no', False: 'yes'})
id text is_relevant
0 1 good,i am interested..please mail me. yes
1 2 call me...good to go with you yes
2 3 not interested...bye no
3 4 i am not interested don't call me no
4 5 price is too high so not interested no
5 6 i have some requirement..please mail me yes
print(df)
这类似于上面 YOLO 的答案,但允许多个文本类。
df = pd.DataFrame(
data = ["good,i am interested..please mail me.",
"call me...good to go with you",
"not interested...bye",
"i am not interested don't call me",
"price is too high so not interested",
"i have some requirement..please mail me"],
columns=['text'], index=[1,2,3,4,5,6])
d1 = {'no': ['Not interested','nt interested','not interested'],
'maybe': ['requirement']}
df['is_relevant'] = 'yes'
for k in d1:
match_inds = reduce(lambda x,y: x | y,
[df['text'].str.contains(pat) for pat in d1[k]])
df.loc[match_inds, 'is_relevant'] = k
print(df)
Output
text is_relevant
1 good,i am interested..please mail me. yes
2 call me...good to go with you yes
3 not interested...bye no
4 i am not interested don't call me no
5 price is too high so not interested no
6 i have some requirement..please mail me maybe
如果您想要的只是列表中的内容['not interested', 'nt interested']
。
如果值在 ad dict 中,请将它们发送到如下列表lst=list(dict.values())
并且仍然是np.where
然后只是np.where
lst=['not interested', 'nt interested']
df['is_relevant']=np.where(df.text.str.contains("|".join(lst)),'no','yes')
text is_relevant
1 good,i am interested..please mail me. yes
2 call me...good to go with you yes
3 not interested...bye no
4 i am not interested don't call me no
5 price is too high so not interested no
6 i have some requirement..please mail me yes
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.