如何使pandas dataframe str.contains搜索速度更快

Question

我在400万行的数据框中搜索子字符串或多个子字符串。

df[df.col.str.contains('Donald',case=True,na=False)]

要么

df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)]

DataFrame（df）如下所示（有400万个字符串行）

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})

有没有提示让这个字符串搜索更快？ 例如，首先排序数据帧，某种索引方式，将列名更改为数字，从查询中删除“na = False”等？ 即使毫秒的速度增加也会非常有帮助！

Answer 1

如果子串的数量很少，则一次搜索一个可能会更快，因为您可以将regex=False参数传递给contains ，从而加快它的速度。

在一个大约6000行的样本DataFrame上，我在两个样本子串上测试它， blah.contains("foo", regex=False) | blah.contains("bar", regex=False)速度是blah.contains("foo|bar")两倍。 您必须使用您的数据对其进行测试，以了解它的扩展方式。

Answer 2

您可以将其转换为列表。 似乎在列表中搜索而不是将字符串方法应用于系列更快。

示例代码：

import timeit
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})



def first_way():
    df["new"] = pd.Series(df["col"].str.contains('Donald',case=True,na=False))
    return None
print "First_way: "
%timeit for x in range(10): first_way()
print df

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})


def second_way():
    listed = df["col"].tolist()
    df["new"] = ["Donald" in n for n in listed]
    return None

print "Second way: "
%timeit for x in range(10): second_way()
print df

结果：

First_way: 
100 loops, best of 3: 2.77 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False
Second way: 
1000 loops, best of 3: 1.79 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False

Answer 3

BrenBarn上面的回答帮助我解决了我的问题。 只需写下我的问题以及如何在下面解决它。 希望它可以帮助别人:)

我的数据大约有2000行。 它主要是文本。 以前，我使用带有忽略大小写的正则表达式，如下所示

reg_exp = ''.join(['(?=.*%s)' % (i) for i in search_list])
series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]  
data_new = data_new[series_to_search.str.contains(reg_exp, flags=re.IGNORECASE)]

对于包含['exception'，'VE20']的搜索列表，此代码耗时58.710898秒。

当我用一个简单的for循环替换这个代码时，它只用了0.055304秒。 改善了1,061.60倍!!!

for search in search_list:            
    series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]
    data_new = data_new[series_to_search.str.lower().str.contains(search.lower())]

如何使pandas dataframe str.contains搜索速度更快

问题描述

3 个解决方案

解决方案1
4 已采纳 2016-06-18 06:39:47

解决方案2
2 2016-06-18 07:17:50

解决方案3
0 2019-05-18 09:56:20

如何使pandas dataframe str.contains搜索速度更快

问题描述

3 个解决方案

解决方案1 4 已采纳 2016-06-18 06:39:47

解决方案2 2 2016-06-18 07:17:50

解决方案3 0 2019-05-18 09:56:20

解决方案1
4 已采纳 2016-06-18 06:39:47

解决方案2
2 2016-06-18 07:17:50

解决方案3
0 2019-05-18 09:56:20