[英]Using pandas, how do I check if a particular sequence exist in a column?
I have a dataframe:我有一个 dataframe:
df = pd.DataFrame({'Sequence': ['ABCDEFG', 'AWODIH', 'AWODIHAWD], 'Length': [7, 6, 9]})
I want to be able to check if a particular sequence, say 'WOD', exists in any entry of the 'Sequence' column.我希望能够检查特定序列(例如“WOD”)是否存在于“序列”列的任何条目中。 It doesn't have to be in the middle or the ends of the entry, but just if that sequence, in that order, exists in any entry of that column, return true.
它不必位于条目的中间或末尾,但如果该序列按该顺序存在于该列的任何条目中,则返回 true。
How would I do this?我该怎么做?
I looked into.isin and.contains, both of which only returns if the exact, and ENTIRE, sequence is in the column:我查看了 .isin 和 .contains,这两个函数仅在列中存在准确且完整的序列时才返回:
df.isin('ABCDEFG') //returns true
df.isin('ABC') //returns false
I want a sort of Cltr+F function that could search any sequence in that order, regardless of where it is or how long it is.我想要一种 Cltr+F function 可以按该顺序搜索任何序列,无论它在哪里或多长时间。
Can simply do this using str.contains
:可以使用
str.contains
简单地做到这一点:
In [657]: df['Sequence'].str.contains('WOD')
Out[657]:
0 False
1 True
2 True
Name: Sequence, dtype: bool
OR, you can use str.find
:或者,您可以使用
str.find
:
In [658]: df['Sequence'].str.find('WOD')
Out[658]:
0 -1
1 1
2 1
Name: Sequence, dtype: int64
Which returns -1
on failure.失败时返回
-1
。
We need use str.findall
before contains
我们需要在
contains
之前使用str.findall
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
0 False
1 True
2 True
Name: Sequence, dtype: bool
If you want to use your in syntax, you can do:如果你想使用你的 in 语法,你可以这样做:
df.Sequence.apply(lambda x: 'WOD' in x)
If performance is a consideration, the following solution is many times faster than other solutions:如果考虑性能,以下解决方案比其他解决方案快许多倍:
['WOD' in e for e in df.Sequence]
Benchmark基准
%%timeit
['WOD' in e for e in df.Sequence]
8.26 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.Sequence.apply(lambda x: 'WOD' in x)
164 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['Sequence'].str.contains('WOD')
153 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['Sequence'].str.find('WOD')
159 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
585 µs ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.