How do I specify a range of unicode characters from ' '
(space) to \×FF
?
I have a regular expression like r'[\ -\×FF]'
and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.
Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?
The syntax of your unicode range will not do what you expect.
The raw r''
string prevents \\u\u003c/code> escapes from being parsed, and the regex engine will not do this.
The only range in this set is
[0-\\]
:
>>> re.compile(r'[\ -\×ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102
Making it a Unicode literal causes
\\u\u003c/code> parsing while leaving other backslashes alone
(although that's not a concern here), but the leading zeroes are messing it up.
The syntax is
\\uxxxx
or \\Uxxxxxxxx
, so it's parsed as " \×
, f
, f
".
>>> re.compile(ur'[\ -\×ff]', re.DEBUG) in range (32, 215) literal 102 literal 102
Removing the leading zeroes or switching to
\\U0000d7ff
will fix it:
>>> re.compile(ur'[\ -\]', re.DEBUG) in range (32, 55295)
If you're using Python 2.x, you should make sure you're specifying a unicode string (with u'', or the "unicode" built-in):
>>> r = re.compile(u'[\u0020-\uD7FF]')
>>> r.search(u'foo \uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>
Using raw strings (as you are, with r'') gives you the (ascii) string composed by "backstroke" + the letter "u" plus the number 0 plus...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.