简体   繁体   中英

Python 2 vs Python 3 Regex matching behavior

Python 3

import re

P = re.compile(r'[\s\t]+') 
re.sub(P, ' ', '\xa0 haha')
' haha' 

Python 2

import re

P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'

I desire the Python 3 behavior, but in Python 2 code. How come the regex pattern fails to match space-like codepoints like \\xa0 in Python 2 but correctly matches these in Python 3?

Use the re.UNICODE flag:

>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'

Without the flag, only ASCII whitespace is matched; \\xa0 is not part of the ASCII standard (it is a Latin-1 codepoint).

The re.UNICODE flag is the default in Python 3; use re.ASCII if you wanted to have the Python 2 (bytestring) behaviour.

Note that there is no point in including \\t in the character class; \\t is already part of the \\s class, so the following will match the exact same input:

P = re.compile(r'\s+', flags=re.UNICODE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM