简体   繁体   English

在 pandas dataframe 列中查找单词的最快方法

[英]Fastest way to find a word in a pandas dataframe column

I have a dataframe like this:我有一个像这样的 dataframe:

name姓名 sentence句子
Tom汤姆 The cat is on the table.猫在桌子上。
Bob鲍勃 One might say that caterpillars are majestic有人可能会说毛毛虫是雄伟的

I want to get as a result a dataframe like this:结果我想得到一个像这样的 dataframe :

name姓名 sentence句子 contains_cat contains_cat
Tom汤姆 The cat is on the table.猫在桌子上。 True真的
Bob鲍勃 One might say that caterpillars are majestic有人可能会说毛毛虫是雄伟的 False错误的

So the column "contains_cat" has to show True only if the corresponding row of column "sentence" contains exactly the word cat (not cat erpillar, for example).因此,只有当“句子”列的相应行恰好包含单词 cat(例如,不是cat erpillar)时,“contains_cat”列才必须显示 True。

I wrote a code that does this, searching for words like " cat " or " cat.".我写了一个代码来做这个,搜索像“cat”或“cat.”这样的词。 Is it possible to speed this up, considering that I'd like to do this for large dataframes and for many words?考虑到我想为大型数据框和许多单词执行此操作,是否可以加快速度?

import pandas as pd

df = pd.DataFrame({'name': ['Tom', 'Bob'],
              'sentence': ['The cat is on the table.', 'One might say that caterpillars are majestic']})
df['contains_cat'] = False

string_to_find = [' cat ',
                  'Cat ',
                  ' cat.']
for ii in range(0,len(string_to_find)):
    df1 = pd.DataFrame({'dummy': [string_to_find[ii]] * len(df)})
    df['contains_cat'] = df['contains_cat'] | \
                         [x[0] in x[1] for x in zip(df1['dummy'], df['sentence'])]

print(df)

Use str.contains :使用str.contains

df["contains_cat"] = df["sentence"].str.contains(r'\bcat\b')

Note that the regex pattern \bcat\b will find exact matches for the word cat (but not cat as part of a substring of a larger words such as caterpillar ).请注意,正则表达式模式\bcat\b将找到单词cat的完全匹配(但不是cat作为较大单词的 substring 的一部分,例如caterpillar )。 Regex search is enabled by default with str.contains .默认情况下,使用str.contains启用正则表达式搜索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM