简体   繁体   中英

Regex named groups and conditional logic

Consider the following string (edit: this is not a parsing HTML with regexs questions. Rather just an exercise with named groups):

s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""

Is it possible to use re.sub and named groups to transform the string into this result?

<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

Right now I have the following code:

import re
regex = re.compile("(<(?P<end>\/*)A(?P<number>\d+)>)")
print regex.sub('<\g<end>test number="\g<number>">', s)

which gives the following result

<T1>
  <test number="1">
  lorem ipsum
  </test number="1">
</T1>

Can an | operator be used like in this question ?

x="""<T1>
    <A1>
    lorem ipsum
    </A1>
  </T1>"""

def repl(obj):

    if obj.group(1):
        return '/test'
    else:
        return 'test number="'+obj.group(2)+'"'

print re.sub(r"(\/*)A(\d+)",repl,x)

You can tyr the replacement function provided by re.sub .

Try to match the whole tag. Not only the opening and closing tags but catch also it's contents.

REgex:

(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>

REplacement string:

<test number="\g<number>">\5</test>

DEMO

>>> s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""
>>> import re
>>> print(re.sub(r'(?s)(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>', r'<test number="\g<number>">\5</test>', s))
<T1>
        <test number="1">
        lorem ipsum
        </test>
      </T1>

(?s) called DOTALL modifier which matches makes dot in your regex to match even newline characters also.

You can use look-around to match string between <T1> and </T1> :

>>> p = re.compile(ur'(?<=<T1>)[^<]+?(.+)(?=</T1>)', re.MULTILINE | re.IGNORECASE | re.DOTALL)
>>> s2='\n  <test number="1">\n  lorem ipsum\n  </test>\n'
>>> print p.sub(s2,s,re.MULTILINE)
<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

you need to use following Contents :

re.IGNORECASE Perform case-insensitive matching; expressions like [AZ] will match lowercase letters, too. This is not affected by the current locale.

re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM