简体   繁体   中英

How to display/convert a string of utf-8 to the proper symbol

I have a list that has WhatsApp emoticons encoded as utf-8 characters. The table I am using to decode the emoticons is at http://apps.timwhitlock.info/emoji/tables/unicode

With this table I am trying to count the number of emoticons used, which I have successfully done using regex techniques. The problem is I have created a dictionary where the keys are the utf-8 characters as strings and the key_values are integers. The following:

print d_emo
for k, v in d_emo.items():
    print k.encode('utf8'), v

produces this output:

{'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
\xF0\x9F\x98\xA2 2
\xF0\x9F\x98\x82 1
\xF0\x9F\x98\x86 2
\xF0\x9F\x98\x89 1
\xF0\x9F\x8D\xB5 2
\xF0\x9F\x8D\xB0 4
\xF0\x9F\x8D\xAB 2
\xF0\x9F\x8D\xA9 2
\xF0\x9F\x98\x98 1
\xE2\x98\xBA 33
\xE2\x98\x95 1

If I use this code:

for k, v in d_emo.items():
    print k.encode('utf-8').decode('unicode_escape'), v

I get

ð¢ 2
ð 1
ð 2
ð 1
ðµ 2
ð° 4
ð« 2
ð© 2
ð 1
⺠33
â 1

I should be getting smiley faces and the like. Any suggestions? This is in Python 2.7.

This will decode the Unicode characters correctly, but in Python 2.X you are somewhat limited when using characters outside the BMP (Basic Multilingual Plane, characters U+0000 to U+FFFF):

import unicodedata as ud
D = {'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
for k,v in D.iteritems():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print k,repr(k),n

Output:

☺ u'\u263a' WHITE SMILING FACE
🍩 u'\U0001f369' no such name
☕ u'\u2615' HOT BEVERAGE
😂 u'\U0001f602' no such name
🍫 u'\U0001f36b' no such name
😢 u'\U0001f622' no such name
😉 u'\U0001f609' no such name
😘 u'\U0001f618' no such name
😆 u'\U0001f606' no such name
🍵 u'\U0001f375' no such name
🍰 u'\U0001f370' no such name

It comes out better in Python 3.X:

import unicodedata as ud
D = {b'\\xF0\\x9F\\x98\\xA2': 2, b'\\xF0\\x9F\\x98\\x82': 1, b'\\xF0\\x9F\\x98\\x86': 2, b'\\xF0\\x9F\\x98\\x89': 1, b'\\xF0\\x9F\\x8D\\xB5': 2, b'\\xF0\\x9F\\x8D\\xB0': 4, b'\\xF0\\x9F\\x8D\\xAB': 2, b'\\xF0\\x9F\\x8D\\xA9': 2, b'\\xF0\\x9F\\x98\\x98': 1, b'\\xE2\\x98\\xBA': 33, b'\\xE2\\x98\\x95': 1}
for k,v in D.items():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print(k,ascii(k),n)

Output (note your font has to support the characters):

😘 '\U0001f618' FACE THROWING A KISS
🍰 '\U0001f370' SHORTCAKE
😢 '\U0001f622' CRYING FACE
🍫 '\U0001f36b' CHOCOLATE BAR
🍵 '\U0001f375' TEACUP WITHOUT HANDLE
🍩 '\U0001f369' DOUGHNUT
😂 '\U0001f602' FACE WITH TEARS OF JOY
😉 '\U0001f609' WINKING FACE
☕ '\u2615' HOT BEVERAGE
😆 '\U0001f606' SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES
☺ '\u263a' WHITE SMILING FACE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM