简体   繁体   English

你可以使用Python正则表达式从偏移量向后搜索吗?

[英]Can you search backwards from an offset using a Python regular expression?

Given a string, and a character offset within that string, can I search backwards using a Python regular expression? 给定一个字符串,以及该字符串中的字符偏移量,我可以使用Python正则表达式向后搜索吗?

The actual problem I'm trying to solve is to get a matching phrase at a particular offset within a string, but I have to match the first instance before that offset. 我试图解决的实际问题是在字符串中的特定偏移处获得匹配的短语,但我必须匹配该偏移之前的第一个实例。

In a situation where I have a regex that's one symbol long (ex: a word boundary), I'm using a solution where I reverse the string. 在我有一个长度为一个符号的正则表达式的情况下(例如:单词边界),我正在使用一个解决方法来反转字符串。

my_string = "Thanks for looking at my question, StackOverflow."
offset = 30
boundary = re.compile(r'\b')
end = boundary.search(my_string, offset)
end_boundary = end.start()
end_boundary

Output: 33 产量:33

end = boundary.search(my_string[::-1], len(my_string) - offset - 1)
start_boundary = len(my_string) - end.start()
start_boundary

Output: 25 产量:25

my_string[start_boundary:end_boundary]

Output: 'question' 输出:'问题'

However, this "reverse" technique won't work if I have a more complicated regular expression that may involve multiple characters. 但是,如果我有一个可能涉及多个字符的更复杂的正则表达式,这种“反向”技术将无法工作。 For example, if I wanted to match the first instance of "ing" that appears before a specified offset: 例如,如果我想匹配在指定偏移量之前出现的第一个“ing”实例:

my_new_string = "Looking feeding dancing prancing"
offset = 16 # on the word dancing
m = re.match(r'(.*?ing)', my_new_string) # Except looking backwards

Ideal output: feeding 理想输出:喂食

I can likely use other approaches (split the file up into lines, and iterate through the lines backwards) but using a regular expression backwards seems like a conceptually-simpler solution. 我可以使用其他方法(将文件拆分为行,并向后遍历行)但向后使用正则表达式似乎是一个概念上更简单的解决方案。

Using positive lookbehind to make sure there are at least 30 characters before a word: 使用正向lookbehind确保单词前至少有30个字符:

# re like: r'.*?(\w+)(?<=.{30})'
m = re.match(r'.*?(\w+)(?<=.{%d})' % (offset), my_string)
if m: print m.group(1)
else: print "no match"

For the other example negative lookbehind may help: 对于另一个例子,负面观察可能会有所帮助:

my_new_string = "Looking feeding dancing prancing"
offset = 16
m = re.match(r'.*(\b\w+ing)(?<!.{%d})' % offset, my_new_string)
if m: print m.group(1)

which first greedy matches any character but backtracks until it fails to match 16 characters backwards ( (?<!.{16}) ). 哪个第一个贪婪匹配任何角色但回溯直到它无法向后匹配16个字符( (?<!.{16}) )。

We can make use of python's regex engine's preference for greediness (sort of, not really), and tell it that what we want is as many characters as possible, but no more than 30, then ... . 我们可以利用python的正则表达式引擎对贪婪的偏好(有点,不是真的),并告诉它我们想要的是尽可能多的字符,但不超过30,然后.......

An appropriate regex, then, can be r'^.{0,30}(\\b)' . 然后,适当的正则表达式可以是r'^.{0,30}(\\b)' We want the start of the first capture. 我们想要第一次捕获的开始。

>>> boundary = re.compile(r'^.{0,30}(\b)')
>>> boundary.search("hello, world; goodbye, world; I am not a pie").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am not").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am").start(1)
30
>>> boundary.search("hello, world; goodbye, pie").start(1)
26
>>> boundary.search("hello, world; pie").start(1)
17

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM