[英]Fastest way to find a word in a pandas dataframe column
I have a dataframe like this:我有一个像这样的 dataframe:
name![]() |
sentence![]() |
---|---|
Tom![]() |
The cat is on the table.![]() |
Bob![]() |
One might say that caterpillars are majestic![]() |
I want to get as a result a dataframe like this:结果我想得到一个像这样的 dataframe :
name![]() |
sentence![]() |
contains_cat ![]() |
---|---|---|
Tom![]() |
The cat is on the table.![]() |
True![]() |
Bob![]() |
One might say that caterpillars are majestic![]() |
False![]() |
So the column "contains_cat" has to show True only if the corresponding row of column "sentence" contains exactly the word cat (not cat erpillar, for example).因此,只有当“句子”列的相应行恰好包含单词 cat(例如,不是cat erpillar)时,“contains_cat”列才必须显示 True。
I wrote a code that does this, searching for words like " cat " or " cat.".我写了一个代码来做这个,搜索像“cat”或“cat.”这样的词。 Is it possible to speed this up, considering that I'd like to do this for large dataframes and for many words?
考虑到我想为大型数据框和许多单词执行此操作,是否可以加快速度?
import pandas as pd
df = pd.DataFrame({'name': ['Tom', 'Bob'],
'sentence': ['The cat is on the table.', 'One might say that caterpillars are majestic']})
df['contains_cat'] = False
string_to_find = [' cat ',
'Cat ',
' cat.']
for ii in range(0,len(string_to_find)):
df1 = pd.DataFrame({'dummy': [string_to_find[ii]] * len(df)})
df['contains_cat'] = df['contains_cat'] | \
[x[0] in x[1] for x in zip(df1['dummy'], df['sentence'])]
print(df)
Use str.contains
:使用
str.contains
:
df["contains_cat"] = df["sentence"].str.contains(r'\bcat\b')
Note that the regex pattern \bcat\b
will find exact matches for the word cat
(but not cat
as part of a substring of a larger words such as caterpillar
).请注意,正则表达式模式
\bcat\b
将找到单词cat
的完全匹配(但不是cat
作为较大单词的 substring 的一部分,例如caterpillar
)。 Regex search is enabled by default with str.contains
.默认情况下,使用
str.contains
启用正则表达式搜索。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.