Python - Reading Emoji Unicode Characters

Question

I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:

u'that\u2019s \U0001f63b'

The apostrophe is represented by \’ , but the emoji is represented by \\U0001f63b . I looked up the code point for the emoji in question, and it's \ . I'm not sure where the 0001 is coming from. I know comically little about character encodings.

When I print the text, character by character, using:

s = u'that\u2019s \U0001f63b'

for c in s:
    print c.encode('unicode_escape')

The program produces the following output:

t
h
a
t
\u2019
s

\ud83d
\ude3b

How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001 s before reading it, or is there an easier, less silly way?

Answer 1

I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X

>>> s = u'that\u2019s \U0001f63b'
>>> print s
that’s 😻

In python3, though -

Python 3.4.3 (default, Jul  7 2015, 15:40:07) 
>>> s  = u'that\u2019s \U0001f63b'
>>> s[-1]
'😻'

Answer 2

Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair .

Python - Reading Emoji Unicode Characters

Question

2 answers

solution1
18 ACCPTED 2015-07-07 22:25:00

solution2
3 2015-07-07 22:34:05

Python - Reading Emoji Unicode Characters

Question

2 answers

solution1 18 ACCPTED 2015-07-07 22:25:00

solution2 3 2015-07-07 22:34:05

solution1
18 ACCPTED 2015-07-07 22:25:00

solution2
3 2015-07-07 22:34:05