简体   繁体   中英

python MULTILINE re.findall spliting on one blank line

I have a regular expression that succeeds if the input has 2 blank lines between the string sections that I want to separate. But it fails if there is only one blank line. Can you advise?

I'm scanning a big folder of files for class definitions. Here's an example that works as intended, it notices 2 classes and separates them.

classstr = """class friendly():
    def __init__(x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def printme(): 
        print(self.x)


class unfriendly():
    def friendme():
        assert self.x == self.y
"""
y = re.findall(r'^(class\s.*)((?:\n^\s.+)+)', classstr, re.MULTILINE)

The output is a list with two tuples, one for each class.

In [15]: y
Out[15]: 
[('class friendly():',
  '\n    def __init__(x, y, z):\n        self.x = x\n        self.y = y\n        self.z = z'),
 ('class unfriendly():',
  '\n    def friendme():\n        assert self.x == self.y')]

If the input classstr is changed to have only one blank row between the classes (as often is the case in example code), then the whole code block comes back in one blob tuple:

[('class friendly():',
  '\n    def __init__(x, y, z):\n        self.x = x\n        self.y = y\n        self.z = z\n\n    def printme():\n        print(self.x)\n\nclass unfriendly():\n    def friendme():\n        assert self.x == self.y')]

I cannot understand how to terminate the re on a single blank line, apparently.

Suggestions?

There are a few things to note:

  • Using \\s can also match a newline
  • In your pattern you use \\n^\\s.+ Meaning that the next line should contain at least 2 characters, or a newline and a line with at least a single character

If there is 1 empty line in between it will match the next part as \\n and \\s will match the newline, and .+ will match the rest of the line.

It will not work when there are 2 empty lines, as \\n will match the first empty line, \\s will then match the the second newline but the .+ can not match as the line is empty.

What you could do, is match all lines that do not start with class and make the quantifier for the whole line * to also match empty lines in between.

^(class[^\S\n].*)((?:\n(?!class ).*)+)
  • ^ Start of string
  • ( Capture group 1
    • class[^\\S\\n].* match class followed by a whitespace char without a newline and the rest of the line
  • ) Close group 1
  • ( Capture group 2
    • (?: Non capture group (to repeat as a whole)
      • \\n(?!class ).* Match a newline and assert that the line does not start with class
    • )+ Close the non capture group and repeat at least 1+ times
  • ) Close group 2

Regex demo

I found a little trouble in applying the accepted answer. The solution did solve the problem as I presented it, but in practice the files I'm sifting through include other code besides classes. So if there was a def following a class, the given code would include the function definition within the output for the preceeding class (because the re` was looking literally for lines taht started with anything except "class" to determine inclusion).

For example, the re gobbled in the def amorous here:

classstr = """class friendly():
    def __init__(x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def printme():
        print(self.x)


class unfriendly():
    def friendme():
        assert self.x == self.y

def amorous():
    x = 3
"""

I to avoid that, I adjusted the re in the solution to include all new lines that do not begin with a character, rather than including all lines that do not begin literally with "class".

y = re.findall(r'^(class[^\S\n].*)((?:\n(?!\S+).*)+)', classstr,
                   re.MULTILINE)
y2 = ["".join(i) for i in y]

And the output is good, at least for this example.

>>> pprint(y2, compact=True)
['class friendly():\n'
 '    def __init__(x, y, '
 'z):\n'
 '        self.x = x\n'
 '        self.y = y\n'
 '        self.z = z\n'
 '\n'
 '    def printme():\n'
 '        print(self.x)\n'
 '\n',
 'class unfriendly():\n'
 '    def friendme():\n'
 '        assert self.x == self.y\n']

I'm waiting to see what other bad-effects this revision causes :)

Meanwhile, I have found some other benefits from this change. Using similar re to separate def function declarations works fine, even for nested functions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM