简体   繁体   中英

How do I specify a range of unicode characters

How do I specify a range of unicode characters from ' ' (space) to \×FF ?

I have a regular expression like r'[\ -\×FF]' and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.

Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

The syntax of your unicode range will not do what you expect.

  1. The raw r'' string prevents \\u\u003c/code> escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\\] :

     >>> re.compile(r'[\ -\×ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102 
  2. Making it a Unicode literal causes \\u\u003c/code> parsing while leaving other backslashes alone (although that's not a concern here), but the leading zeroes are messing it up. The syntax is \\uxxxx or \\Uxxxxxxxx , so it's parsed as " , f , f ".

     >>> re.compile(ur'[\ -\×ff]', re.DEBUG) in range (32, 215) literal 102 literal 102 
  3. Removing the leading zeroes or switching to \\U0000d7ff will fix it:

     >>> re.compile(ur'[\ -\퟿]', re.DEBUG) in range (32, 55295) 

If you're using Python 2.x, you should make sure you're specifying a unicode string (with u'', or the "unicode" built-in):

>>> r = re.compile(u'[\u0020-\uD7FF]')
>>> r.search(u'foo \uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r'') gives you the (ascii) string composed by "backstroke" + the letter "u" plus the number 0 plus...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM