Python 3
import re
P = re.compile(r'[\s\t]+')
re.sub(P, ' ', '\xa0 haha')
' haha'
Python 2
import re
P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'
I desire the Python 3 behavior, but in Python 2 code. How come the regex pattern fails to match space-like codepoints like \\xa0
in Python 2 but correctly matches these in Python 3?
Use the re.UNICODE
flag:
>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'
Without the flag, only ASCII whitespace is matched; \\xa0
is not part of the ASCII standard (it is a Latin-1 codepoint).
The re.UNICODE
flag is the default in Python 3; use re.ASCII
if you wanted to have the Python 2 (bytestring) behaviour.
Note that there is no point in including \\t
in the character class; \\t
is already part of the \\s
class, so the following will match the exact same input:
P = re.compile(r'\s+', flags=re.UNICODE)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.