[英]pandas DataFrame multiple substrings match, also put the particular matched substring for a row into a new column
I'm trying to extract some records from a survey response DF. 我正试图从调查回复DF中提取一些记录。 All of these records need to contain at least one of some key words.
所有这些记录都需要至少包含一些关键词。 For example: Now I have a dataframe df:
例如:现在我有一个数据帧df:
svy_rspns_txt
I like it
I hate it
It's a scam
It's shaddy
Scam!
Good service
Very disappointed
Now if I run 现在,如果我跑
kw="hate,scam,shaddy,disappoint"
sensitive_words=[unicode(x,'unicode-escape') for x in kw.lower().split(",")]
df=df[df["svy_rspns_txt"].astype('unicode').str.contains('|'.join(sensitive_words),case=False,na=False)]
I will get result like 我会得到像这样的结果
svy_rspns_txt
I hate it
It's a scam
It's shaddy
Scam!
Very disappointed
Now how can I add a column "matched_word" to show what exact string is matched so I can get the result like: 现在我如何添加一列“matched_word”来显示匹配的确切字符串,这样我就可以获得如下结果:
svy_rspns_txt matched_word
I hate it hate
It's a scam scam
It's shaddy shaddy
Scam! scam
Very disappointed disappoint
Using a generator expression with next
: 使用生成器表达式与
next
:
df = pd.DataFrame({'text': ["I like it", "I hate it", "It's a scam", "It's shaddy",
"Scam!", "Good service", "Very disappointed"]})
kw = "hate,scam,shaddy,disappoint"
words = set(kw.split(','))
df['match'] = df['text'].apply(lambda x: next((i for i in words if i in x.lower()), np.nan))
print(df)
text match
0 I like it NaN
1 I hate it hate
2 It's a scam scam
3 It's shaddy shaddy
4 Scam! scam
5 Good service NaN
6 Very disappointed disappoint
You can filter for valid strings via pd.Series.notnull
or by noting NaN != NaN
: 您可以通过
pd.Series.notnull
或通过注意NaN != NaN
来过滤有效字符串:
res = df[df['match'].notnull()]
# or, res = df[df['match'].notna()]
# or, res = df[df['match'] == df['match']]
print(res)
text match
1 I hate it hate
2 It's a scam scam
3 It's shaddy shaddy
4 Scam! scam
6 Very disappointed disappoint
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.