简体   繁体   中英

Python re find start and end index of group match

Python's re match objects have.start() and.end() methods on the match object. I want to find the start and end index of a group match. How can I do this? Example:

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> match.group('num')
'889'
>>> match.start()
6
>>> match.end()
11
>>> match.group('num').start()                  # just trying this. Didn't work
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'start'
>>> REGEX.groupindex
mappingproxy({'num': 1})                        # this is the index of the group in the regex, not the index of the group match, so not what I'm looking for.

The expected output above is (7, 10)

A workaround for the given example could be using lookarounds:

import re
REGEX = re.compile(r'(?<=h)[0-9]{3}(?=p)')
test = "hello h889p something"
match = REGEX.search(test)
print(match)

Output

<re.Match object; span=(7, 10), match='889'>

You could just use string indexing and the index() method:

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> test.index(match.group('num')[0])
7
>>> test.index(match.group('num')[-1])
9

If you want the results as a tuple:

>>> str_match = match.group("num")
>>> results = (test.index(str_match[0]), test.index(str_match[-1]))
>>> results
(7, 9)

Note: As Tom pointed out , you may want to consider using results = (test.index(str_match), text.index(str_match)+len(str_match)) in order to prevent bugs which may arise from the string having identical characters. For example, if the number were 899 , then results would be (7, 8) , since the first instance of 9 is at index 8.

A slight modification on the existing answer is to use index to find the whole group, rather than the starting and ending characters of the group:

import re
REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
test = "hello h889p something"
match = REGEX.search(test)
group = match.group('num')

# modification here to find the start point
idx = test.index(group)

# find the end point using len of group
output = (idx, idx + len(group)) #(7, 10)

This checks for the whole string "889" when determining the index. So there is a little less potential for error then checking for the first 8 and the first 9 , though it is still not perfect (ie if "889" appears earlier in the string, not surrounded by "h" and "p" ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM