简体   繁体   中英

regex: match exactly three lines

I want to match the following input. How would I match a group a certain number of times without using a multiline string? Something like (^(\\d+) (.+)$){3}) (but that doesn't work).

sample_string = """Breakpoint 12 reached 
         90  good morning
     91  this is cool
     92  this is bananas
     """
pattern_for_continue = re.compile("""Breakpoint \s (\d+) \s reached \s (.+)$
                                 ^(\d+)\s+  (.+)\n
                                 ^(\d+)\s+  (.+)\n
                                 ^(\d+)\s+  (.+)\n
                                  """, re.M|re.VERBOSE)
matchobj = pattern_for_continue.match(sample_string)
    print matchobj.group(0)

There are a series of problems with your expression and sample:

  • Your use of VERBOSE makes all spaces not match, so your spaces around the digits on the first line are ignored too. Replace spaces with \\s or [ ] (the latter only matches a literal space, the former matches newlines and tabs too).

  • Your input sample has whitespace before the digit on each line but your example pattern requires that the digits are at the start of the line. Either allow for that whitespace or fix your sample input.

  • The biggest problem is that a capturing group inside a repeating group (so (\\d+) inside of a larger group with {3} at the end) only captures the last match. You'll get 92 and this is bananas , not the previous two matched lines.

To overcome all that, you have to repeat that pattern for the three lines explicitly. You could use Python to implement that repetition:

linepattern =  r'[ ]* (\d+) [ ]+ ([^\n]+)\n'

pattern_for_continue = re.compile(r"""
    Breakpoint [ ]+ (\d+) [ ]+ reached [ ]+ ([^\n]*?)\n
    {}
""".format(linepattern * 3), re.MULTILINE|re.VERBOSE)

Which, for your sample input, returns:

>>> pattern_for_continue.match(sample_string).groups()
('12', '', '90', 'hey this is a great line', '91', 'this is cool too', '92', 'this is bananas')

If you really do not want to match spaces before the digits on the 3 extra lines, you can remove the first [ ]* pattern from linepattern .

Code

You need something more like this:

import re

sample_string = """Breakpoint 12 reached 
90  hey this is a great line
91  this is cool too
92  this is bananas
"""
pattern_for_continue = re.compile(r"""
    Breakpoint\s+(\d+)\s+reached\s+\n
    (\d+)  ([^\n]+?)\n
    (\d+)  ([^\n]+?)\n
    (\d+)  ([^\n]+?)\n
""", re.MULTILINE|re.VERBOSE)
matchobj = pattern_for_continue.match(sample_string)

for i in range(1, 8):
    print i, matchobj.group(i)
print "Entire match:"
print matchobj.group(0)

Result

1 12
2 90
3   hey this is a great line
4 91
5   this is cool too
6 92
7   this is bananas
Entire match:
0 Breakpoint 12 reached 
90  hey this is a great line
91  this is cool too
92  this is bananas

Reasons

  • re.VERBOSE makes explicit whitespace necessary in your regex. I partially fixed this by left-justifying your data in the multiline string. I think this is justified because you probably don't have this in real code; it's likely an artifact of editing in a multiline string.

  • you need to replace $ with \\n .

  • you need non-greedy matches

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM