how can I extract only emoji from utf-8 with regex in python?

Question

env python3.6 There's a utf-8 encoded text like this

text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"

And I want to search only elements which three numbers or alphabets follow b'\\xf0\\x9f\\x98\\' - this actually indicates the facial expression emojis.

I tried this

if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)

but it doesn't work and when I print it off it comes like this b'\\xf0\\x9f\\x98\\\\[a-zA-Z1-9]{3}' and \\ automatically gets in it. Any way out? thanks.

Answer 1

I can see two problems with your search:

you are trying to search the textual representation of the utf8 string (the \\xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ( $ ) in your search, where you're probably interested in its occurrence anywhere in the string.

Something like the following should work, though brittle (see below for a more robust solution):

re.search(b'\xf0\x9f\x98.', text_utf8)

This will give you the first occurrence of a 4-byte unicode sequences prefixed by \\xf0\\x9f\\x98 .

Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (ie: you don't have to worry about this prefix appearing in the middle of a longer sequence).

A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:

regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))

This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)

Note that in this case you can also deal with text_utf8 as an actual unicode ( str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

how can I extract only emoji from utf-8 with regex in python?

Question

1 answers

solution1
1 2019-06-12 12:09:56

how can I extract only emoji from utf-8 with regex in python?

Question

1 answers

solution1 1 2019-06-12 12:09:56

solution1
1 2019-06-12 12:09:56