How do I specify a range of unicode characters

Question

How do I specify a range of unicode characters from ' ' (space) to \×FF ?

I have a regular expression like r'[\ -\×FF]' and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.

Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

Answer 1

The syntax of your unicode range will not do what you expect.

The raw r'' string prevents \\u\u003c/code> escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\\] :

 >>> re.compile(r'[\ -\×ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102

Making it a Unicode literal causes \\u\u003c/code> parsing while leaving other backslashes alone (although that's not a concern here), but the leading zeroes are messing it up. The syntax is \\uxxxx or \\Uxxxxxxxx , so it's parsed as " \× , f , f ".
```
 >>> re.compile(ur'[\ -\×ff]', re.DEBUG) in range (32, 215) literal 102 literal 102 
```
Removing the leading zeroes or switching to \\U0000d7ff will fix it:
```
 >>> re.compile(ur'[\ -\퟿]', re.DEBUG) in range (32, 55295) 
```

Answer 2

If you're using Python 2.x, you should make sure you're specifying a unicode string (with u'', or the "unicode" built-in):

>>> r = re.compile(u'[\u0020-\uD7FF]')
>>> r.search(u'foo \uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r'') gives you the (ascii) string composed by "backstroke" + the letter "u" plus the number 0 plus...

How do I specify a range of unicode characters

Question

2 answers

solution1
27 ACCPTED 2010-10-01 01:59:37

solution2
5 2010-10-01 01:33:28

How do I specify a range of unicode characters

Question

2 answers

solution1 27 ACCPTED 2010-10-01 01:59:37

solution2 5 2010-10-01 01:33:28

solution1
27 ACCPTED 2010-10-01 01:59:37

solution2
5 2010-10-01 01:33:28