简体   繁体   English

使用 pandas,如何检查列中是否存在特定序列?

[英]Using pandas, how do I check if a particular sequence exist in a column?

I have a dataframe:我有一个 dataframe:

df = pd.DataFrame({'Sequence': ['ABCDEFG', 'AWODIH', 'AWODIHAWD], 'Length': [7, 6, 9]})

I want to be able to check if a particular sequence, say 'WOD', exists in any entry of the 'Sequence' column.我希望能够检查特定序列(例如“WOD”)是否存在于“序列”列的任何条目中。 It doesn't have to be in the middle or the ends of the entry, but just if that sequence, in that order, exists in any entry of that column, return true.它不必位于条目的中间或末尾,但如果该序列按该顺序存在于该列的任何条目中,则返回 true。

How would I do this?我该怎么做?

I looked into.isin and.contains, both of which only returns if the exact, and ENTIRE, sequence is in the column:我查看了 .isin 和 .contains,这两个函数仅在列中存在准确且完整的序列时才返回:

df.isin('ABCDEFG') //returns true
df.isin('ABC') //returns false

I want a sort of Cltr+F function that could search any sequence in that order, regardless of where it is or how long it is.我想要一种 Cltr+F function 可以按该顺序搜索任何序列,无论它在哪里或多长时间。

Can simply do this using str.contains :可以使用str.contains简单地做到这一点:

In [657]: df['Sequence'].str.contains('WOD')    
Out[657]: 
0    False
1     True
2     True
Name: Sequence, dtype: bool

OR, you can use str.find :或者,您可以使用str.find

In [658]: df['Sequence'].str.find('WOD')
Out[658]: 
0   -1
1    1
2    1
Name: Sequence, dtype: int64

Which returns -1 on failure.失败时返回-1

We need use str.findall before contains我们需要在contains之前使用str.findall

df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
0    False
1     True
2     True
Name: Sequence, dtype: bool

If you want to use your in syntax, you can do:如果你想使用你的 in 语法,你可以这样做:

df.Sequence.apply(lambda x: 'WOD' in x)

If performance is a consideration, the following solution is many times faster than other solutions:如果考虑性能,以下解决方案比其他解决方案快许多倍:

['WOD' in e for e in df.Sequence]

Benchmark基准

%%timeit
['WOD' in e for e in df.Sequence]
8.26 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df.Sequence.apply(lambda x: 'WOD' in x)
164 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.contains('WOD')   
153 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.find('WOD')
159 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
585 µs ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用pandas基于另一列[SoldDate]找到特定列[Model]的计数? - How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas? 如何检查熊猫中另一个数组中存在的数组中值的百分比? - How do I check the % of values in an array that exist in another array in pandas? 如何使用掩码变量将特定值分配给 pandas 中的列数据? - How do I assign a particular value to a column data in pandas using mask variable? 如何使用 Python 中的 Pandas 库减去特定列中的所有行值? - How do I subtract all the rows value in a particular column using Pandas library in Python? 使用 pandas dataframe 时,如果不存在,如何添加列? - When using a pandas dataframe, how do I add column if does not exist? 检查 pandas 列中的日期是否按顺序 - Check if dates are in sequence in pandas column 检查 Pandas DataFrame 列中的序列 - Check for sequence in column of Pandas DataFrame 如何使用熊猫检查日期列中的日期是否在不同列中的两个日期之间? - How do I check if a date in a date column is between two dates in different columns using pandas? 如何使用 python 和 Pandas 检查行是否在特定列名处包含 1 - How do I check to see if a row contains a 1 at a specific column name using python with Pandas 如何检查 Pandas 日期时间列是否存在缺失值? - How do I check a Pandas Datetime column for missing values?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM