简体   繁体   中英

Why some emojis are not converted back into their representation?

I am working on emoji detection module. For some emojis I am observing weird behavior that is after converting them to utf-8 encoding they are not converted back to their original representation form. I need their exact colored representation to be send as API response instead of sending unicode escaped string. Any leads?

In [1]: x = "example1: 🤭 and example2: 😁 and example3: 🥺" 

In [2]: x.encode('utf8')                                                                                                                                                                                                          
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'

In [3]: x.encode('utf8').decode('utf8')                                                                                                                                                                                           
Out[3]: 'example1: \U0001f92d and example2: 😁 and example3: \U0001f97a'

In [4]: print( x.encode('utf8').decode('utf8')  )                                                                                                                                                                                 
*example1: 🤭 and example2: 😁 and example3: 🥺*

Link Emoji used in example

Update 1: By this example it must be much clearer to explain. Here, two emojis are rendered when I have send unicode escape string, but 3rd exampled failed to convert exact emoji, what to do in such case?

API 查看代码 使用 Postman 的 API 响应

'\U0001f92d' == '' is True . It is an escape code but is still the same character...Two ways of display/entry. The former is the repr() of the string, printing calls str() . Example:

>>> s = '🤭'
>>> print(repr(s))
'\U0001f92d'
>>> print(str())
🤭
>>> s
'\U0001f92d'
>>> print(s)
🤭

When Python generates the repr() it uses an escape code representation if it thinks the display can't handle the character. The content of the string is still the same...the Unicode code point.

It's a debug feature. For example, is the white space spaces or tabs? The repr() of the string makes it clear by using \t as an escape code.

>>> s = 'a\tb'
>>> print(s)
a       b
>>> s
'a\tb'

As to why an escape code is used for one emoji and not another, it depends on the version of Unicode supported by the version of Python used.

Pyton 3.8 uses Unicode 9.0, and one of your emoji isn't defined at that version level:

>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('😁')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('🤭')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM