How to display/convert a string of utf-8 to the proper symbol

Question

I have a list that has WhatsApp emoticons encoded as utf-8 characters. The table I am using to decode the emoticons is at http://apps.timwhitlock.info/emoji/tables/unicode

With this table I am trying to count the number of emoticons used, which I have successfully done using regex techniques. The problem is I have created a dictionary where the keys are the utf-8 characters as strings and the key_values are integers. The following:

print d_emo
for k, v in d_emo.items():
    print k.encode('utf8'), v

produces this output:

{'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
\xF0\x9F\x98\xA2 2
\xF0\x9F\x98\x82 1
\xF0\x9F\x98\x86 2
\xF0\x9F\x98\x89 1
\xF0\x9F\x8D\xB5 2
\xF0\x9F\x8D\xB0 4
\xF0\x9F\x8D\xAB 2
\xF0\x9F\x8D\xA9 2
\xF0\x9F\x98\x98 1
\xE2\x98\xBA 33
\xE2\x98\x95 1

If I use this code:

for k, v in d_emo.items():
    print k.encode('utf-8').decode('unicode_escape'), v

I get

ð¢ 2
ð 1
ð 2
ð 1
ðµ 2
ð° 4
ð« 2
ð© 2
ð 1
âº 33
â 1

I should be getting smiley faces and the like. Any suggestions? This is in Python 2.7.

Answer 1

This will decode the Unicode characters correctly, but in Python 2.X you are somewhat limited when using characters outside the BMP (Basic Multilingual Plane, characters U+0000 to U+FFFF):

import unicodedata as ud
D = {'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
for k,v in D.iteritems():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print k,repr(k),n

Output:

☺ u'\u263a' WHITE SMILING FACE
🍩 u'\U0001f369' no such name
☕ u'\u2615' HOT BEVERAGE
😂 u'\U0001f602' no such name
🍫 u'\U0001f36b' no such name
😢 u'\U0001f622' no such name
😉 u'\U0001f609' no such name
😘 u'\U0001f618' no such name
😆 u'\U0001f606' no such name
🍵 u'\U0001f375' no such name
🍰 u'\U0001f370' no such name

It comes out better in Python 3.X:

import unicodedata as ud
D = {b'\\xF0\\x9F\\x98\\xA2': 2, b'\\xF0\\x9F\\x98\\x82': 1, b'\\xF0\\x9F\\x98\\x86': 2, b'\\xF0\\x9F\\x98\\x89': 1, b'\\xF0\\x9F\\x8D\\xB5': 2, b'\\xF0\\x9F\\x8D\\xB0': 4, b'\\xF0\\x9F\\x8D\\xAB': 2, b'\\xF0\\x9F\\x8D\\xA9': 2, b'\\xF0\\x9F\\x98\\x98': 1, b'\\xE2\\x98\\xBA': 33, b'\\xE2\\x98\\x95': 1}
for k,v in D.items():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print(k,ascii(k),n)

Output (note your font has to support the characters):

😘 '\U0001f618' FACE THROWING A KISS
🍰 '\U0001f370' SHORTCAKE
😢 '\U0001f622' CRYING FACE
🍫 '\U0001f36b' CHOCOLATE BAR
🍵 '\U0001f375' TEACUP WITHOUT HANDLE
🍩 '\U0001f369' DOUGHNUT
😂 '\U0001f602' FACE WITH TEARS OF JOY
😉 '\U0001f609' WINKING FACE
☕ '\u2615' HOT BEVERAGE
😆 '\U0001f606' SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES
☺ '\u263a' WHITE SMILING FACE

How to display/convert a string of utf-8 to the proper symbol

Question

1 answers

solution1
2 2015-06-02 16:36:22

How to display/convert a string of utf-8 to the proper symbol

Question

1 answers

solution1 2 2015-06-02 16:36:22

solution1
2 2015-06-02 16:36:22