简体   繁体   中英

How to capture multiple repeating patterns with regular expression?

I get some string like this: \\input{{whatever}{1}}\\mypath{{path1}{path2}{path3}...{pathn}}\\shape{{0.2}{0.3}} I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture. For example: r"\\\\mypath\\{(\\{[^\\{\\}\\[\\]]*\\})*\\}" will only return the last matched group. Applying the pattern to search(r"\\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)

Then I found an alternative way to do this:

    gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
    gpathRegexCp=re.compile(gpathRegexPat)
    strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
    >>> strpath
    '{sadf}{ad}'
    p=re.compile('\{([^\{\}\[\]]*)\}')
    >>> p.findall(strpath)
    ['sadf', 'ad']

or:

    >>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
    >>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
    >>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
    >>> strpath
    '\\mypath{{sadf}{ad}}'
    >>> p.findall(strpath)
    ['sadf', 'ad']

At this point, I thought, why not just use the findall on the original string? I may use: gpathRegexPat=r"(?:\\\\mypath\\{)(?:\\{[^\\{\\}\\[\\]]*\\})*?\\{([^\\{\\}\\[\\]]*)\\}(?:\\{[^\\{\\}\\[\\]]*\\})*?(?:\\})" : if the first (?:\\{[^\\{\\}\\[\\]]*\\})*? matches 0 time and the 2nd (?:\\{[^\\{\\}\\[\\]]*\\})*? matches 1 time, it will capture sadf ; if the first (?:\\{[^\\{\\}\\[\\]]*\\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad . However, it will only return ['sadf'] with this regex.

With out all those extra patterns ( (?:\\\\mypath\\{) and (?:\\}) ), it actually works:

    >>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
    >>> p2.findall(strpath)
    ['sadf', 'ad']
    >>> p2.findall('{adadd}{dfada}{adafadf}')
    ['adadd', 'dfada', 'adafadf']

Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?

You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.

In this case that would be something like: \\\\mypath{(?:\\{.*?\\})} . This will return {path1}{path2}{path3}

Then to find the repeating patterns of {pathn} inside that string, you can simply use \\{(.*?)\\} . This will match anything withing the braces. The .*? is a non-greedy version of .* , meaning it will return the shortest possible match instead of the longest possible match.

re.findall("{([^{}]+)}",text)

should work

returns

['path1', 'path2', 'path3', 'pathn']

finally

my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)

or even better

re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM