Python 2 vs Python 3 Regex matching behavior

Question

Python 3

import re

P = re.compile(r'[\s\t]+') 
re.sub(P, ' ', '\xa0 haha')
' haha'

Python 2

import re

P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'

I desire the Python 3 behavior, but in Python 2 code. How come the regex pattern fails to match space-like codepoints like \\xa0 in Python 2 but correctly matches these in Python 3?

Answer 1

Use the re.UNICODE flag:

>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'

Without the flag, only ASCII whitespace is matched; \\xa0 is not part of the ASCII standard (it is a Latin-1 codepoint).

The re.UNICODE flag is the default in Python 3; use re.ASCII if you wanted to have the Python 2 (bytestring) behaviour.

Note that there is no point in including \\t in the character class; \\t is already part of the \\s class, so the following will match the exact same input:

P = re.compile(r'\s+', flags=re.UNICODE)

Python 2 vs Python 3 Regex matching behavior

Question

1 answers

solution1
5 ACCPTED 2015-01-22 12:08:06

Python 2 vs Python 3 Regex matching behavior

Question

1 answers

solution1 5 ACCPTED 2015-01-22 12:08:06

solution1
5 ACCPTED 2015-01-22 12:08:06