简体   繁体   中英

Unicode re.sub() doesn't work with \g<0> (group 0)

Why doesn't the \\g<0> work with unicode regex?

When I tried to use \\g<0> to insert a space before and after the group with normal string regex, it works:

>>> punct = """,.:;!@#$%^&*(){}{}|\/?><"'"""
>>> rx = re.compile('[%s]' % re.escape(punct))
>>> text = '''"anständig"'''
>>> rx.sub(r" \g<0> ",text)
' " anst\xc3\xa4ndig " '
>>> print rx.sub(r" \g<0> ",text)
 " anständig " 

but with unicode regex, the space isn't added:

>>> punct = u""",–−—’‘‚”“‟„!£"%$'&)(+*-€/.±°´·¸;:=<?>@§#¡•[˚]»_^`≤…\«¿¨{}|"""
>>> rx = re.compile("["+"".join(punct)+"]", re.UNICODE)
>>> text = """„anständig“"""
>>> rx.sub(ur" \g<0> ", text)
'\xe2\x80\x9eanst\xc3\xa4ndig\xe2\x80\x9c'
>>> print rx.sub(ur" \g<0> ", text)
„anständig“
  1. How do I get \\g to work in unicode regex?
  2. If (1) is not possible, how do I get the unicode regex input the space before and after a character in punct ?

I think you have two errors. First, you are not escaping punct like in the first example with re.escape and you have characters like [] that need to be escaped. And second, text variable is not unicode. Example that works:

>>> punct = re.escape(u""",–−—’‘‚”“‟„!£"%$'&)(+*-€/.±°´·¸;:=<?>@§#¡•[˚]»_^`≤…\«¿¨{}|""")
>>> rx = re.compile("["+"".join(punct)+"]", re.UNICODE)
>>> text = u"""„anständig“"""
>>> print rx.sub(ur" \g<0> ", text)
 „ anständig “

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM