简体   繁体   中英

Python: Find first non-matching character

Under Python, when you want to obtain the index of the first occurrence of a substring or character within a list, you use something like this:

s.find("f")

However, I'd like to find the index of the first character within the string that does not match. Currently, I'm using the following:

iNum = 0
for i, c in enumerate(line):
  if(c != mark):
    iNum = i
    break

Is there a more efficient way to do this, such as a built-in function I don't know about?

You can use regular expressions, for example:

>>> import re
>>> re.search(r'[^f]', 'ffffooooooooo').start()
4

[^f] will match any character except for f , and the start() method of a Match object (returned by re.search() ) will give the index that the match occurred.

To make sure you can also handle empty strings or strings that only contain f you would want to check to make sure the result of re.search() is not None , which will happen if the regex cannot be matched. For example:

first_index = -1
match = re.search(r'[^f]', line)
if match:
    first_index = match.start()

If you prefer not to use regex, you won't do any better than your current method. You could use something like next(i for i, c in enumerate(line) if c != mark) , but you would need to wrap this with a try and except StopIteration block to handle empty lines or lines that consist of only mark characters.

As python and as simple as possible. replace print(counter) with print counter for python 2.x

s = "ffffff5tgbh44frff"
counter = 0
for c in s:
    counter = counter + 1
    if c != "f":
        break

print (counter)

Now i am curious how these two fare.

>>> # map with a partial function
>>> import functools
>>> import operator
>>> f = functools.partial(operator.eq, 'f')
>>> map(f, 'fffffooooo').index(False)
5
>>> # list comprehension
>>> [c == 'f' for c in 'ffffoooo'].index(False)
4
>>>

I had this same problem and looked into timing the solutions here (except the map/list-comp ones from @wwii which are significantly slower than any other options). I also added in a Cython version of the original version.

I made and tested these all in Python v2.7. I was using byte-strings (instead of Unicode strings). I am unsure if the regular-expression methods need something different to work with byte-strings in Python v3. The 'mark' is hard-coded to being the null byte. This could be easily changed.

All methods return -1 if the entire byte-string is the null-byte. All of these were tested in IPython (lines starting with % are special).

import re

def f1(s): # original version
    for i, c in enumerate(s):
        if c != b'\0': return i
    return -1

def f2(s): # @ChristopherMahan's version
    i = 0
    for c in s:
        if c != b'\0': return i
        i += 1
    return -1

def f3(s): # @AndrewClark's alternate version
    # modified to use optional default argument instead of catching StopIteration
    return next((i for i, c in enumerate(s) if c != b'\0'), -1)

def f4(s): # @AndrewClark's version
    match = re.search(br'[^\0]', s)
    return match.start() if match else -1

_re = re.compile(br'[^\0]')
def f5(s): # @AndrewClark's version w/ precompiled regular expression
    match = _re.search(s)
    return match.start() if match else -1

%load_ext cythonmagic
%%cython
# original version optimized in Cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def f6(bytes s):
    cdef Py_ssize_t i
    for i in xrange(len(s)):
        if s[i] != b'\0': return i
    return -1

The timing results:

s = (b'\x00' * 32) + (b'\x01' * 32) # test string

In [11]: %timeit f1(s) # original version
100000 loops, best of 3: 2.48 µs per loop

In [12]: %timeit f2(s) # @ChristopherMahan's version
100000 loops, best of 3: 2.35 µs per loop

In [13]: %timeit f3(s) # @AndrewClark's alternate version
100000 loops, best of 3: 3.07 µs per loop

In [14]: %timeit f4(s) # @AndrewClark's version
1000000 loops, best of 3: 1.91 µs per loop

In [15]: %timeit f5(s) # @AndrewClark's version w/ precompiled regular expression
1000000 loops, best of 3: 845 ns per loop

In [16]: %timeit f6(s) # original version optimized in Cython
1000000 loops, best of 3: 305 ns per loop

Overall, @ChristopherMahan's version is slightly faster than the original (apparently enumerate is slower than using your own counter). Using the next (@AndrewClark's alternate version) method is slower than the original even though it is essentially the same thing in a one-line form.

Using regular-expresions (@AndrewClark's version) are significantly faster than a loop, especially if you pre-compile the regex!

Then, if you can use Cython, it is by far the fastest. The OP's concern that using a regex is slow is validated, but a loop in Python is even slower. The loop in Cython is quite fast.

Here is a oneliner:

> print([a == b for (a_i, a) in enumerate("compare_me") for
(b_i, b) in enumerate("compar me") if a_i == b_i].index(False))
> 6
> "compare_me"[6]
> 'e'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM