简体   繁体   中英

Can the [a-zA-Z] Python regex pattern be made to match and replace non-ASCII Unicode characters?

In the following regular expression, I would like each character in the string replaced with an 'X', but it isn't working.

In Python 2.7:

>>> import re
>>> re.sub(u"[a-zA-Z]","X","dfäg")
'XX\xc3\xa4X'

or

>>> re.sub("[a-zA-Z]","X","dfäg",re.UNICODE)
u'XX\xe4X'

In Python 3.4:

>>> re.sub("[a-zA-Z]","X","dfäg")
'XXäX'

Is it possible to somehow 'configure' the [a-zA-Z] pattern to match 'ä', 'ü', etc.? If this can't be done, how can I create a similar character range pattern between square brackets that would include Unicode characters in the usual 'full alphabet' range? I mean, in a language like German, for instance, 'ä' would be placed somewhere close to 'a' in the alphabet, so one would expect it to be included in the 'az' range.

You may use

(?![\d_])\w

With the Unicode modifier. The (?![\\d_]) look-ahead is restricting the \\w shorthand class so as it could not match any digits ( \\d ) or underscores.

See regex demo

A Python 3 demo :

import re
print (re.sub(r"(?![\d_])\w","X","dfäg"))
# => XXXX

As for Python 2 :

# -*- coding: utf-8 -*-
import re
s = "dfäg"
w = re.sub(ur'(?![\d_])\w', u'X', s.decode('utf8'), 0, re.UNICODE).encode("utf8")
print(w)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM