pandas：如何限制str.contains的结果？

Question

I have a DataFrame with >1M rows. 我有一个包含> 1M行的DataFrame。 I'd like to select all the rows where a certain column contains a certain substring: 我想选择某个列包含某个子字符串的所有行：

matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()

But this selection is slow and I'd like to speed it up. 但这种选择很慢，我想加快速度。 Let's say I only need the first n results. 假设我只需要前n个结果。 Is there a way to stop matching after getting n results? 获得n个结果后有没有办法停止matching ？ I've tried: 我试过了：

matching = df['col2'].str.contains('substr', case=True, regex=False).head(n)

and: 和：

matching = df['col2'].str.contains('substr', case=True, regex=False).sample(n)

but they aren't any faster. 但它们并不快。 The second statement is boolean and very fast. 第二个语句是布尔值，非常快。 How can I speed up the first statement? 我怎样才能加快第一个陈述？

Answer 1

Believe it or not but .str accessor is slow. 信不信由你.str访问器很慢。 You can use list comprehensions with better performance. 您可以使用具有更好性能的列表推导。

df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})

Test for equality 测试平等

all(df['col2'].str.contains('substr', case=True, regex=False) ==
    pd.Series(['substr' in i for i in df['col2']]))

Output: 输出：

True

Timings: 时序：

%timeit df['col2'].str.contains('substr', case=True, regex=False)
10 loops, best of 3: 37.9 ms per loop

versus 与

%timeit pd.Series(['substr' in i for i in df['col2']])
100 loops, best of 3: 19.1 ms per loop

Answer 2

You can spead it up with: 你可以用：

matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]

However this solution would retrieve the matching results within the first n rows, not the first n matching results. 但是，此解决方案将检索前n行中的匹配结果，而不是前n匹配结果。

In case you actually want the first n matching results you should use: 如果你真的想要前n匹配的结果你应该使用：

rows =  df['col1'][df['col2'].str.contains("substr")==True].head(n)

But this option is way slower of course. 但是这个选项当然要慢一些。

Inspired in @ScottBoston's answer you can use following approach for a complete faster solution : 受到@ ScottBoston答案的启发，您可以使用以下方法获得更快的解决方案 ：

rows = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)

This is faster but not that faster than showing the whole results with this option. 这比使用此选项显示整个结果更快但速度更快。 With this solution you can get the first n matching results. 使用此解决方案，您可以获得前n匹配结果。

With below test code we can see how fast is each solution and it's results: 使用以下测试代码，我们可以看到每个解决方案的速度和结果：

import pandas as pd
import time

n = 10
a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"]
b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"]

col1 = a*1000000
col2 = b*1000000

df = pd.DataFrame({"col1":col1,"col2":col2})

# Original option
start_time = time.time()
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
print("--- %s seconds ---" % (time.time() - start_time))

# Faster option
start_time = time.time()
matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows_fast = df['col1'].head(n)[matching==True]
print("--- %s seconds for fast solution ---" % (time.time() - start_time))


# Other option
start_time = time.time()
rows_other =  df['col1'][df['col2'].str.contains("substr")==True].head(n)
print("--- %s seconds for other solution ---" % (time.time() - start_time))

# Complete option
start_time = time.time()
rows_complete = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
print("--- %s seconds for complete solution ---" % (time.time() - start_time))

This would output: 这将输出：

>>> 
--- 2.33899998665 seconds ---
--- 0.302999973297 seconds for fast solution ---
--- 4.56700015068 seconds for other solution ---
--- 1.61599993706 seconds for complete solution ---

And the resulting Series would be: 最终的系列将是：

>>> rows
4     for
5    this
Name: col1, dtype: object
>>> rows_fast
4     for
5    this
Name: col1, dtype: object
>>> rows_other
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object
>>> rows_complete
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object

pandas：如何限制str.contains的结果？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-03-15 19:06:04

解决方案2
1 2018-03-15 18:07:36

pandas：如何限制str.contains的结果？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-03-15 19:06:04

解决方案2 1 2018-03-15 18:07:36

解决方案1
2 已采纳 2018-03-15 19:06:04

解决方案2
1 2018-03-15 18:07:36