简体   繁体   中英

Python regex capturing extra newline

So, I've been cooking some regex, and it seems the regex library is capturing an extra new line when I use ((.|\\s)*) to capture multi-line text.. [\\S\\s]* works for some reason:

If you see below, the first regex produces an additional \\n group, why?? :

>>> s = """
... #pragma whatever
... #pr
... asdfsadf
... #pragma START-SomeThing-USERCODE
... this is the code
... this is more
... #pragma END-SomeThing-USERCODE
... asd
... asdf
... sadf
... sdaf
... """
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)((.|\s)*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)([\S\s]*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]

The subregex

((.|\s)*)

matches "this is the code\\nthis is more\\n" . The outer parentheses capture this entire string.

The inner parentheses capture one character at a time (either any character besides newlines, or a space (including newline)). Since that group is repeated, the contents of the group are overwritten with each repetition. At the end of the match, the last character that was matched ( \\n ) is kept in that group.

So, if you want to avoid that, either make the inner group non-capturing:

((?:.|\s)*)

or use the ([\\s\\S]*) idiom for matching truly any character. It might be a good idea to use ([\\s\\S]*?) , though, to make sure that the smallest possible number of characters are matched.

This expression produces nested group

((.|\s)*)

Because you use nested braces. For single-character OR square braces is a proper choice; this syntax is suitable when you want to chose between 2 words

(treat|trick)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM