简体   繁体   中英

how can I extract only emoji from utf-8 with regex in python?

env python3.6 There's a utf-8 encoded text like this

text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"

And I want to search only elements which three numbers or alphabets follow b'\\xf0\\x9f\\x98\\' - this actually indicates the facial expression emojis.

I tried this

if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)

but it doesn't work and when I print it off it comes like this b'\\xf0\\x9f\\x98\\\\[a-zA-Z1-9]{3}' and \\ automatically gets in it. Any way out? thanks.

I can see two problems with your search:

  1. you are trying to search the textual representation of the utf8 string (the \\xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
  2. you are including the "end-of-string" marker ( $ ) in your search, where you're probably interested in its occurrence anywhere in the string.

Something like the following should work, though brittle (see below for a more robust solution):

re.search(b'\xf0\x9f\x98.', text_utf8)

This will give you the first occurrence of a 4-byte unicode sequences prefixed by \\xf0\\x9f\\x98 .

Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (ie: you don't have to worry about this prefix appearing in the middle of a longer sequence).


A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:

regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))

This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)

Note that in this case you can also deal with text_utf8 as an actual unicode ( str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM