简体   繁体   English

pandas DataFrame多个子字符串匹配,也将一行的特定匹配子字符串放入新列

[英]pandas DataFrame multiple substrings match, also put the particular matched substring for a row into a new column

I'm trying to extract some records from a survey response DF. 我正试图从调查回复DF中提取一些记录。 All of these records need to contain at least one of some key words. 所有这些记录都需要至少包含一些关键词。 For example: Now I have a dataframe df: 例如:现在我有一个数据帧df:

svy_rspns_txt
I like it
I hate it
It's a scam
It's shaddy
Scam!
Good service
Very disappointed

Now if I run 现在,如果我跑

kw="hate,scam,shaddy,disappoint"
sensitive_words=[unicode(x,'unicode-escape') for x in kw.lower().split(",")]
df=df[df["svy_rspns_txt"].astype('unicode').str.contains('|'.join(sensitive_words),case=False,na=False)]

I will get result like 我会得到像这样的结果

svy_rspns_txt
I hate it
It's a scam
It's shaddy
Scam!
Very disappointed

Now how can I add a column "matched_word" to show what exact string is matched so I can get the result like: 现在我如何添加一列“matched_word”来显示匹配的确切字符串,这样我就可以获得如下结果:

svy_rspns_txt            matched_word
I hate it                hate
It's a scam              scam
It's shaddy              shaddy
Scam!                    scam
Very disappointed        disappoint

Using a generator expression with next : 使用生成器表达式与next

df = pd.DataFrame({'text': ["I like it", "I hate it", "It's a scam", "It's shaddy",
                            "Scam!", "Good service", "Very disappointed"]})

kw = "hate,scam,shaddy,disappoint"

words = set(kw.split(','))

df['match'] = df['text'].apply(lambda x: next((i for i in words if i in x.lower()), np.nan))

print(df)

                text       match
0          I like it         NaN
1          I hate it        hate
2        It's a scam        scam
3        It's shaddy      shaddy
4              Scam!        scam
5       Good service         NaN
6  Very disappointed  disappoint

You can filter for valid strings via pd.Series.notnull or by noting NaN != NaN : 您可以通过pd.Series.notnull或通过注意NaN != NaN来过滤有效字符串:

res = df[df['match'].notnull()]
# or, res = df[df['match'].notna()]
# or, res = df[df['match'] == df['match']]

print(res)

                text       match
1          I hate it        hate
2        It's a scam        scam
3        It's shaddy      shaddy
4              Scam!        scam
6  Very disappointed  disappoint

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 filter rows in a pandas dataframe from substrings (keys) in a list and also add new column "key" to dataframe containing the substring matched (key) - filter rows in a pandas dataframe from substrings (keys) in a list and also add new column "key" to dataframe containing the substring matched (key) 如何在 pandas dataframe 中的特定行创建一个新列并插入值? - How to create a new column and insert value at a particular row in pandas dataframe? 删除pandas dataframe列中的多个子字符串 - Removing multiple substrings in a pandas dataframe column 如何在 pandas 中拆分多行并添加新列? - How to split row in multiple rows and add new column also in pandas? 在 pandas dataframe 中找到 substring 并保存在新的列中 - Find substring in pandas dataframe and save in new column 如何根据 pandas dataframe 中其他列中的子字符串创建新列? - How to create new column based on substrings in other column in a pandas dataframe? pandas dataframe 替换列的多个 substring - pandas dataframe replace multiple substring of column Python Pandas:如何按特定行在列组中添加值并将结果放入新列 - Python Pandas : how to add values in column group by a particular row and put the result in a new column 如何在包含特定“子字符串”的单列 CSV 文件中选择特定 ROW 并使用 python Pandas 添加到新列表中? - How do I select a particular ROW in a single column CSV file containing a particular "substring" and add to a new list with python Pandas? 按列的子字符串对Pandas Dataframe进行排序 - Sort Pandas Dataframe by substrings of a column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM