使用 pandas，如何检查列中是否存在特定序列？

Question

I have a dataframe:我有一个 dataframe：

df = pd.DataFrame({'Sequence': ['ABCDEFG', 'AWODIH', 'AWODIHAWD], 'Length': [7, 6, 9]})

I want to be able to check if a particular sequence, say 'WOD', exists in any entry of the 'Sequence' column.我希望能够检查特定序列（例如“WOD”）是否存在于“序列”列的任何条目中。 It doesn't have to be in the middle or the ends of the entry, but just if that sequence, in that order, exists in any entry of that column, return true.它不必位于条目的中间或末尾，但如果该序列按该顺序存在于该列的任何条目中，则返回 true。

How would I do this?我该怎么做？

I looked into.isin and.contains, both of which only returns if the exact, and ENTIRE, sequence is in the column:我查看了 .isin 和 .contains，这两个函数仅在列中存在准确且完整的序列时才返回：

df.isin('ABCDEFG') //returns true
df.isin('ABC') //returns false

I want a sort of Cltr+F function that could search any sequence in that order, regardless of where it is or how long it is.我想要一种 Cltr+F function 可以按该顺序搜索任何序列，无论它在哪里或多长时间。

Answer 1

Can simply do this using str.contains :可以使用str.contains简单地做到这一点：

In [657]: df['Sequence'].str.contains('WOD')    
Out[657]: 
0    False
1     True
2     True
Name: Sequence, dtype: bool

OR, you can use str.find :或者，您可以使用str.find ：

In [658]: df['Sequence'].str.find('WOD')
Out[658]: 
0   -1
1    1
2    1
Name: Sequence, dtype: int64

Which returns -1 on failure.失败时返回-1 。

Answer 2

We need use str.findall before contains我们需要在contains之前使用str.findall

df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
0    False
1     True
2     True
Name: Sequence, dtype: bool

Answer 3

If you want to use your in syntax, you can do:如果你想使用你的 in 语法，你可以这样做：

df.Sequence.apply(lambda x: 'WOD' in x)

If performance is a consideration, the following solution is many times faster than other solutions:如果考虑性能，以下解决方案比其他解决方案快许多倍：

['WOD' in e for e in df.Sequence]

Benchmark基准

%%timeit
['WOD' in e for e in df.Sequence]
8.26 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df.Sequence.apply(lambda x: 'WOD' in x)
164 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.contains('WOD')   
153 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.find('WOD')
159 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
585 µs ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

使用 pandas，如何检查列中是否存在特定序列？

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-05-16 00:16:04

解决方案2
0 2020-05-16 00:14:12

解决方案3
0 2020-05-16 00:24:33

使用 pandas，如何检查列中是否存在特定序列？

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-05-16 00:16:04

解决方案2 0 2020-05-16 00:14:12

解决方案3 0 2020-05-16 00:24:33

解决方案1
1 已采纳 2020-05-16 00:16:04

解决方案2
0 2020-05-16 00:14:12

解决方案3
0 2020-05-16 00:24:33