简体   繁体   中英

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200 , and the pattern of not being a number (regex pattern in Python for this would be [^0-9] ), I would need '8' (the last 'e' before the '200') as result.

What is the most pythonic way to achieve this?

As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page ), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:

import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()

I am not satisfied with this for two reasons: - a) I need to reverse string before using it with [::-1] , and - b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.

There needs to be better ways for this, likely even with the result of re.search() .

I am aware of re.search(...).end() over .start() , but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start() , .end() , etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.

What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.

Update

The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash eg because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.

You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:

import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])

Prints:

8

Edit: For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:

import re

arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']

for s in arr:
    lst = [m.start() for m in re.finditer(r'\D', s)]
    print(s, '-->', lst[-1] if len(lst) > 0 else None)

Prints the following, where if no such index is found then prints None instead of index:

 --> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19

Edit 2: As OP stated in his post, \\d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \\d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:

import re

arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']

for s in arr:
    m = re.match(r'.*(\D)', s)
    print(s, '-->', m.start(1) if m else None)

Prints the string and their corresponding index of non-digit char and None if not found any:

 --> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19

And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match .

But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \\d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.

To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:

import re

string = 'uiae1iuae200'
pattern = r'[^0-9]'

match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)

Output:

 8 

Or the exact same as a function and with more test cases:

import re


def last_match(pattern, string):
    match = re.match(fr'.*({pattern})', string)
    return match.end(1) - 1 if match else None


cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]

for pattern, string in cases:
    print(f'{pattern}, {string}: {last_match(pattern, string)}')

Output:

 [^0-9], uiae1iuae200: 8 [^0-9], 123a: 3 [^0-9], 123: None [^abc], abcabc1abc: 6 [^1], 11eea11: 4 

This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)) , but it's pretty straightforward and probably not too inefficient.

def last_match(pattern, string):
    for i in range(1, len(string) + 1):
        substring = string[-i:]
        if re.match(pattern, substring):
            return len(string) - i

The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern .

Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM