env python3.6 There's a utf-8 encoded text like this
text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"
And I want to search only elements which three numbers or alphabets follow b'\\xf0\\x9f\\x98\\'
- this actually indicates the facial expression emojis.
I tried this
if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)
but it doesn't work and when I print it off it comes like this b'\\xf0\\x9f\\x98\\\\[a-zA-Z1-9]{3}'
and \\
automatically gets in it. Any way out? thanks.
I can see two problems with your search:
\\xXX
represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes). $
) in your search, where you're probably interested in its occurrence anywhere in the string. Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \\xf0\\x9f\\x98
.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (ie: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8
as an actual unicode
( str
in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.