简体   繁体   中英

List all matches to a "one or more" regex

I am trying to tokenize organize chemical names, ie split "hexane" into ['hex', 'an', 'e'], its constituent parts.

At the core of this issue is: how do I list ~all~ matches to a "one or more" regex, rather than just the last match to that regex?

I am testing by using the following code:

    print("Regex:", reg)
    print("Findall:", re.findall(reg, name))
    print("Finditer", [item.groups() for item in list(re.finditer(reg, name))])
    print("Search:", re.search(reg, name).groups())
    print("Split", re.split(reg, name))
    print("Match", re.match(reg, name).groups())

In all of my tests, name = "hexane". This should parse out to ['hex', 'an', 'e']. My attempted regexes follow the pattern of "\\A({many groups added here, separated by bars})\\Z", where the many groups are a subset of the prefixes and suffixes available to organic chemicals.

When using a regex without parentheses on each section of my regex, I get this output:

Regex: \A((-|nonadeca|heptadeca|tetradec|imine|hept|heptadec|benzene|cyclo|oate|tetradeca|hex|yn|octa|phenyl|arsine|yl|dodec|e|eth|meth|pentadec|nona|phosphino|octadec|di|formyl|arsino|oct|oxo|tridec|penta|pent|dodeca|hydroxy|hexadec|hexa|ol|an|oyl|ether|non|trideca|prop|undec|hepta|pentadeca|nonadec|amine|tri|but|carbonyl|deca|en|amino|undeca|hexadeca|thiol|oxy|tetra|dec|carboxy|chloro|mercapto|iodo|fluoro|octadeca|imino|bromo|al|phosphine|carboxylicacid|amide|one|amido|oicacid)+)\Z
Findall: [('hexane', 'e')]
Finditer [('hexane', 'e')]
Search: ('hexane', 'e')
Split ['', 'hexane', 'e', '']
Match ('hexane', 'e')

This shows that the regex must have correctly found the ['hex', 'an', 'e'] split, as no other combination of parts will provide a comprehensive \\A-STUFF_IN_HERE-\\Z match. However, none of the results provide the molecule split into its component parts for my use.

Putting parentheses around each part gives the following result:

Regex: \A(-|(tetradec)|(thiol)|(phenyl)|(arsino)|(carbonyl)|(one)|(e)|(fluoro)|(ol)|(ether)|(eth)|(trideca)|(hex)|(iodo)|(nonadeca)|(non)|(pent)|(al)|(octa)|(octadec)|(di)|(undeca)|(arsine)|(tri)|(cyclo)|(prop)|(nona)|(dodec)|(phosphine)|(yn)|(but)|(an)|(heptadeca)|(carboxy)|(imine)|(hept)|(octadeca)|(amide)|(imino)|(deca)|(dodeca)|(oct)|(hydroxy)|(bromo)|(undec)|(pentadeca)|(tetra)|(hexadec)|(benzene)|(phosphino)|(hexa)|(tridec)|(mercapto)|(dec)|(oyl)|(oxy)|(meth)|(penta)|(amido)|(oicacid)|(amine)|(yl)|(nonadec)|(tetradeca)|(hexadeca)|(carboxylicacid)|(amino)|(chloro)|(pentadec)|(en)|(hepta)|(heptadec)|(formyl)|(oate)|(oxo))+\Z
Findall: [('e', '', '', '', '', '', '', 'e', '', '', '', '', '', 'hex', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'an', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
Finditer [('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)]
Search: ('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
Split ['', 'e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, '']
Match ('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)

This again shows that the ['hex', 'an', 'e'] parts are being successfully parsed, but it does not provide those parts for me in an easy list.

Note: There are ambiguities such as that between the "hex" and "hexa" prefixes that make a simple left-to-right re.split or re.findall without the \\A\\Z specifier unfeasible. Either precedence would go to "hex" in all cases, in which case "hexapentyldecane" would parse as ["hex", ?????], being broken by the trailing "a", or precedence would go to "hexa", so that "hexane" would parse as ["hexa", ???], broken by the trailing "n".

When you don't know ahead of time how many match groups there will be, a single regex can't capture them all in a convenient structure. But you can loop or split just fine.

import re

string = 'hexane'
while True:
    oldstring = string
    string = re.sub(r'\A(-|nonadeca|heptadeca|tetradec|imine|hept|heptadec|benzene|cyclo|oate|tetradeca|hex|yn|octa|phenyl|arsine|yl|dodec|e|eth|meth|pentadec|nona|phosphino|octadec|di|formyl|arsino|oct|oxo|tridec|penta|pent|dodeca|hydroxy|hexadec|hexa|ol|an|oyl|ether|non|trideca|prop|undec|hepta|pentadeca|nonadec|amine|tri|but|carbonyl|deca|en|amino|undeca|hexadeca|thiol|oxy|tetra|dec|carboxy|chloro|mercapto|iodo|fluoro|octadeca|imino|bromo|al|phosphine|carboxylicacid|amide|one|amido|oicacid)', '', string)
    if not string:
        print(oldstring)
        break
    print(oldstring[0:-len(string)])

The above is not particularly elegant, but should at least get you started.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM