Find length of string matched by regex

Question

I am trying to write a script to parse a map file generated by a compiler/linker, that looks like this:

%SEGMENT_SECTION
                                                      Start Address  End Address
--------------------------------------------------------------------------------
Segment Name: S1_1, Segment Type: .bss                0A000000       0A050F23
--------------------------------------------------------------------------------
area1_start.o (.bss)                                  0A000000       0A000003
...

                                                      Start Address  End Address
--------------------------------------------------------------------------------
Segment Name: S2_1, Segment Type: .bss                0A050F24       0A060000
--------------------------------------------------------------------------------
area2_start.o (.bss)                                  0A000000       0A000003

...

%NEXT_SECTION

I am currently writing several regular expressions (python's re module) to parse this, but I want to write them in a very easy-to-read way, such that it's very simple to parse. Essentially:

with open('blah.map') as f:
    text = f.read()

# ... Parse the file to update text to be after the %SEGMENT_SECTION

match = segment_header_re.match(text)
seg_name, seg_type, start_addr, end_addr = match.groups()
# ... (Do more with matched values)

text = text[len(match.matched_str):]

# Parse the remainder of text

However, I don't know how to get the length of the matched string, as in my match.matched_str pseudo code. I don't see anything in python's documentation of re. Is there a better way to do this type of parsing?

Answer 1

For what you are trying to achieve, use the match.span method.

>>> 
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> m = re.search('brown', s)
>>> m.span()
(10, 15)
>>> start, end = m.span()
>>> s[end:]
' fox jumps over the lazy dog'
>>>

Or just the match.end method.

>>> s[m.end():]
' fox jumps over the lazy dog'
>>>

Another option is to use regular expression objects which can take pos and endpos arguments to limit the search to a portion of the string.

>>> s = 'The quick brown fox jumps over the lazy dog'
>>> over = re.compile('over')
>>> brown = re.compile('brown')
>>> m_brown = brown.search(s)
>>> m_brown.span()
(10, 15)
>>> m_over = over.search(s)
>>> m_over.span()
(26, 30)

Begin the search for over at the end of the match for brown .

>>> match = over.search(s, pos = m_brown.end())
>>> match.group()
'over'
>>> match.span()
(26, 30)

Searching for brown starting at the end of the match for over , will not produce a match.

>>> match = brown.search(s, m_over.end())
>>> match.group()

Traceback (most recent call last):
  File "<pyshell#71>", line 1, in <module>
    match.group()
AttributeError: 'NoneType' object has no attribute 'group'
>>> print(match)
None
>>>

For long strings and multiple searches, using a regular expression object with a start position argument will definitely speed things up.

Answer 2

You can use the .group() method. The entire matched string can be retrieved by match.group(0) :

text = text[len(match.group(0)):]

Demo:

>>> import re
>>> re.match('(a)bc(d)', 'abcde').group(0)  # 'e' is excluded since it wasn't matched
'abcd'
>>>
>>> # You can also get individual capture groups by number (starting at 1)
>>> re.match('(a)bc(d)', 'abcde').group(1)
'a'
>>> re.match('(a)bc(d)', 'abcde').group(2)
'd'
>>>

Note however that this will raise an AttributeError if there was no match:

>>> re.match('xyz', 'abcde').group(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>

You may wish to implement a check that makes sure the match was successful before you go calling methods on the match object.

Find length of string matched by regex

Question

2 answers

solution1
3 ACCPTED 2015-02-03 17:11:45

solution2
1 2015-02-03 16:57:18

Find length of string matched by regex

Question

2 answers

solution1 3 ACCPTED 2015-02-03 17:11:45

solution2 1 2015-02-03 16:57:18

solution1
3 ACCPTED 2015-02-03 17:11:45

solution2
1 2015-02-03 16:57:18