在 pandas dataframe 列中查找单词的最快方法

Question

I have a dataframe like this:我有一个像这样的 dataframe：

name姓名	sentence句子
Tom汤姆	The cat is on the table.猫在桌子上。
Bob鲍勃	One might say that caterpillars are majestic有人可能会说毛毛虫是雄伟的

I want to get as a result a dataframe like this:结果我想得到一个像这样的 dataframe ：

name姓名	sentence句子	contains_cat contains_cat
Tom汤姆	The cat is on the table.猫在桌子上。	True真的
Bob鲍勃	One might say that caterpillars are majestic有人可能会说毛毛虫是雄伟的	False错误的

So the column "contains_cat" has to show True only if the corresponding row of column "sentence" contains exactly the word cat (not cat erpillar, for example).因此，只有当“句子”列的相应行恰好包含单词 cat（例如，不是cat erpillar）时，“contains_cat”列才必须显示 True。

I wrote a code that does this, searching for words like " cat " or " cat.".我写了一个代码来做这个，搜索像“cat”或“cat.”这样的词。 Is it possible to speed this up, considering that I'd like to do this for large dataframes and for many words?考虑到我想为大型数据框和许多单词执行此操作，是否可以加快速度？

import pandas as pd

df = pd.DataFrame({'name': ['Tom', 'Bob'],
              'sentence': ['The cat is on the table.', 'One might say that caterpillars are majestic']})
df['contains_cat'] = False

string_to_find = [' cat ',
                  'Cat ',
                  ' cat.']
for ii in range(0,len(string_to_find)):
    df1 = pd.DataFrame({'dummy': [string_to_find[ii]] * len(df)})
    df['contains_cat'] = df['contains_cat'] | \
                         [x[0] in x[1] for x in zip(df1['dummy'], df['sentence'])]

print(df)

Answer 1

Use str.contains :使用str.contains ：

df["contains_cat"] = df["sentence"].str.contains(r'\bcat\b')

Note that the regex pattern \bcat\b will find exact matches for the word cat (but not cat as part of a substring of a larger words such as caterpillar ).请注意，正则表达式模式\bcat\b将找到单词cat的完全匹配（但不是cat作为较大单词的 substring 的一部分，例如caterpillar ）。 Regex search is enabled by default with str.contains .默认情况下，使用str.contains启用正则表达式搜索。

在 pandas dataframe 列中查找单词的最快方法

问题描述

1 个解决方案

解决方案1
1 2021-04-14 15:06:30

在 pandas dataframe 列中查找单词的最快方法

问题描述

1 个解决方案

解决方案1 1 2021-04-14 15:06:30

解决方案1
1 2021-04-14 15:06:30