简体   繁体   中英

Can you search backwards from an offset using a Python regular expression?

Given a string, and a character offset within that string, can I search backwards using a Python regular expression?

The actual problem I'm trying to solve is to get a matching phrase at a particular offset within a string, but I have to match the first instance before that offset.

In a situation where I have a regex that's one symbol long (ex: a word boundary), I'm using a solution where I reverse the string.

my_string = "Thanks for looking at my question, StackOverflow."
offset = 30
boundary = re.compile(r'\b')
end = boundary.search(my_string, offset)
end_boundary = end.start()
end_boundary

Output: 33

end = boundary.search(my_string[::-1], len(my_string) - offset - 1)
start_boundary = len(my_string) - end.start()
start_boundary

Output: 25

my_string[start_boundary:end_boundary]

Output: 'question'

However, this "reverse" technique won't work if I have a more complicated regular expression that may involve multiple characters. For example, if I wanted to match the first instance of "ing" that appears before a specified offset:

my_new_string = "Looking feeding dancing prancing"
offset = 16 # on the word dancing
m = re.match(r'(.*?ing)', my_new_string) # Except looking backwards

Ideal output: feeding

I can likely use other approaches (split the file up into lines, and iterate through the lines backwards) but using a regular expression backwards seems like a conceptually-simpler solution.

Using positive lookbehind to make sure there are at least 30 characters before a word:

# re like: r'.*?(\w+)(?<=.{30})'
m = re.match(r'.*?(\w+)(?<=.{%d})' % (offset), my_string)
if m: print m.group(1)
else: print "no match"

For the other example negative lookbehind may help:

my_new_string = "Looking feeding dancing prancing"
offset = 16
m = re.match(r'.*(\b\w+ing)(?<!.{%d})' % offset, my_new_string)
if m: print m.group(1)

which first greedy matches any character but backtracks until it fails to match 16 characters backwards ( (?<!.{16}) ).

We can make use of python's regex engine's preference for greediness (sort of, not really), and tell it that what we want is as many characters as possible, but no more than 30, then ... .

An appropriate regex, then, can be r'^.{0,30}(\\b)' . We want the start of the first capture.

>>> boundary = re.compile(r'^.{0,30}(\b)')
>>> boundary.search("hello, world; goodbye, world; I am not a pie").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am not").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am").start(1)
30
>>> boundary.search("hello, world; goodbye, pie").start(1)
26
>>> boundary.search("hello, world; pie").start(1)
17

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM