简体   繁体   中英

Using pandas, how do I check if a particular sequence exist in a column?

I have a dataframe:

df = pd.DataFrame({'Sequence': ['ABCDEFG', 'AWODIH', 'AWODIHAWD], 'Length': [7, 6, 9]})

I want to be able to check if a particular sequence, say 'WOD', exists in any entry of the 'Sequence' column. It doesn't have to be in the middle or the ends of the entry, but just if that sequence, in that order, exists in any entry of that column, return true.

How would I do this?

I looked into.isin and.contains, both of which only returns if the exact, and ENTIRE, sequence is in the column:

df.isin('ABCDEFG') //returns true
df.isin('ABC') //returns false

I want a sort of Cltr+F function that could search any sequence in that order, regardless of where it is or how long it is.

Can simply do this using str.contains :

In [657]: df['Sequence'].str.contains('WOD')    
Out[657]: 
0    False
1     True
2     True
Name: Sequence, dtype: bool

OR, you can use str.find :

In [658]: df['Sequence'].str.find('WOD')
Out[658]: 
0   -1
1    1
2    1
Name: Sequence, dtype: int64

Which returns -1 on failure.

We need use str.findall before contains

df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
0    False
1     True
2     True
Name: Sequence, dtype: bool

If you want to use your in syntax, you can do:

df.Sequence.apply(lambda x: 'WOD' in x)

If performance is a consideration, the following solution is many times faster than other solutions:

['WOD' in e for e in df.Sequence]

Benchmark

%%timeit
['WOD' in e for e in df.Sequence]
8.26 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df.Sequence.apply(lambda x: 'WOD' in x)
164 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.contains('WOD')   
153 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['Sequence'].str.find('WOD')
159 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
585 µs ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM