简体   繁体   中英

Python - odd regex matching with + / * on group

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s*(\w+\.)+', src).groups()
('submod.',)

This regex seems to put everything which is not space into a/the group - nothing to be lost before stop of regex match.

Why is just the last "+" repetition found in the group here - and not ('pkg.subpkg.submod.',) ?

Or ('pkg.',) - early stop because no real repetition - no "loss of information" in another sense?

(I needed to use another (?:...) like r'\\s((?:\\w+\\.)+)' )

Even more strange:

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s(\w+\.)*', src).groups()
(None,)

Edit: the "more strange" is actually "less strange" as @Avinash Raj pointed out, because - unlike intended - the match simply ends before the group; So

>>> re.search(r'\s+(\w+\.)*', '  pkg.subpkg.submod.thing').groups()
('submod.',)

.. then produces the same questioned behavior than "+" : just last repetition - things before seeming lost...

I'll explain the even more strange part..

src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '

re.search stops matching once it finds a first match. So,

r'\\s(\\w+\\.)*' would match the first space character ( * repeats the previous pattern zero or more times ), since there is no match for (\\w+\\.)* after the first space, groups() function on searchObj returns None and group on searchObj should return the space that is the first space.

I do not know, why it is strange for you. What do you expect?

In the documentation you find the following:

re.search(pattern, string, flags=0) Scan through string looking for the first location where the regular expression pattern ...

re.search(r'\s*(\w+\.)+', src).groups()

in your search string you have only one group: (\\w+.) Because it is greedy by default all the pkg.subpkg. is eaten before you find submod. , this is the last that is filled, that the string matches.

your second try doesn't match, cause there is not even 1 group nessesary to fulfil the Statement, so all 3 parts are eaten and inside the Group you find nothing.

Do you look for this?

re.search(r'\s*((\w+\.)+)', src).groups()[0]

Try out the following to understand it better:

re.search(r'\s*((\w+\.)*)(\w+\.)*', 'a.b.c.d.e.f.g.h.i').groups()

This should work fine to match the complete string ' pkg.subpkg.submod.thing pkg2.subpkg.submod.thing '

(\s*(\w+[.\s])+)+

In case you want the output ' pkg.subpkg.submod.thing ' then use this

\s*(\w+[.\s])+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM