简体   繁体   English

在python 3中将表情符号转换为Unicode,反之亦然

[英]Converting emojis to Unicode and vice versa in python 3

I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'.我正在尝试在 python 3 中将表情符号转换为它的 Unicode。例如,我会有表情符号 😀 并且想从中获得相应的 unicode 'U+1F600'。 Similarly I would like to convert the 'U+1F600' back to 😀.同样,我想将 'U+1F600' 转换回😀。 Now I have read the documentation and tried several options but pythons behaviour confuses me here.现在我已经阅读了文档并尝试了几个选项,但是 python 的行为在这里让我感到困惑。

>>> x = '😀'
>>> y = x.encode('utf-8')
>>> y
b'\xf0\x9f\x98\x80'

The emoji is converted to a byte object.表情符号转换为字节对象。

>>> z = y.decode('utf-8')
>>> z
'😀'

Converted the byte object back to the emoji, so far so good.将字节对象转换回表情符号,到目前为止一切顺利。

Now, taking the unicode for the emoji:现在,使用表情符号的 unicode:

>>> c = '\U0001F600'
>>> d = c.encode('utf-8')
>>> d
>>> b'\xf0\x9f\x98\x80'

This prints out the byte encoding again.这将再次打印出字节编码。

>>> d.decode('utf-8')
>>> '😀'

This prints the emoji out again.这会再次打印出表情符号。 I really can't figure out how to convert solely between the Unicode and the emoji.我真的不知道如何仅在 Unicode 和表情符号之间进行转换。

'😀' is already a Unicode object. '😀' 已经是一个 Unicode 对象。 UTF-8 is not Unicode, it's a byte encoding for Unicode. UTF-8 不是 Unicode,它是 Unicode 的字节编码。 To get the codepoint number of a Unicode character, you can use the ord function.要获取 Unicode 字符的代码点编号,可以使用ord函数。 And to print it in the form you want you can format it as hex.并以您想要的形式打印它,您可以将其格式化为十六进制。 Like this:像这样:

s = '😀'
print('U+{:X}'.format(ord(s)))

output输出

U+1F600

If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:如果你有 Python 3.6+,你可以使用 f-string 让它更短(更高效):

s = '😀'
print(f'U+{ord(s):X}')

BTW, if you want to create a Unicode escape sequence like '\\U0001F600' there's the 'unicode-escape' codec.顺便说一句,如果你想创建一个像'\\U0001F600'这样的 Unicode 转义序列,有一个'unicode-escape'编解码器。 However, it returns a bytes string, and you may wish to convert that back to text.但是,它返回一个bytes字符串,您可能希望将其转换回文本。 You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.您可以为此使用“UTF-8”编解码器,但您也可以只使用“ASCII”编解码器,因为它保证仅包含有效的 ASCII。

s = '😀'
print(s.encode('unicode-escape'))
print(s.encode('unicode-escape').decode('ASCII'))

output输出

b'\\U0001f600'
\U0001f600

I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) .我建议你看看 Stack Overflow 联合创始人 Joel Spolsky 这篇短文, 每个软件开发人员绝对、肯定必须了解 Unicode 和字符集(没有借口!)的绝对最小值

sentence = "Head-Up Displays (HUD)💻 for #automotive🚗 sector\n \nThe #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… "
print("normal sentence - ", sentence)

uc_sentence = sentence.encode('unicode-escape')
print("\n\nunicode represented sentence - ", uc_sentence)

decoded_sentence = uc_sentence.decode('unicode-escape')
print("\n\ndecoded sentence - ", decoded_sentence)

output输出

normal sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… 


unicode represented sentence -  b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 '


decoded sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM