简体   繁体   中英

python re.findall weird behaviour

>>> text =\
... """xyxyxy testmatch0
... xyxyxy testmatch1
... xyxyxy
... whyisthismatched1
... xyxyxy testmatch2
...  xyxyxy testmatch3
... xyxyxy
... whyisthismatched2
... """
>>> re.findall("^\s*xyxyxy\s+([a-z0-9]+).*$", text, re.MULTILINE)
[u'testmatch0', u'testmatch1', u'whyisthismatched1', u'testmatch2', u'testmatch3', u'whyisthismatched2']

So my expectations would be to not match the lines containing "whyisthismatched".

The Python re documentation states the following:

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

My question would be if this is really the expected behaviour or a bug. If it is expected someone please explain why those lines are matching and how I should modify my pattern to get the behaviour I expect:

[u'testmatch0', u'testmatch1', u'testmatch2', u'testmatch3']

Newlines are whitespace too as far as the \\s character class is concerned. If you want to match spaces only you need to match [ ] instead:

>>> re.findall("^\s*xyxyxy[ ]+([a-z0-9]+).*$", text, re.MULTILINE)
[u'testmatch0', u'testmatch1', u'testmatch2', u'testmatch3']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM