在 Python 中將 Unicode 轉義為表情符號

Question

我正在嘗試將轉義的 Unicode 轉換為表情符號。

例子：

>>> emoji = "😀"
>>> emoji_text = "\\ud83d\\ude00"
>>> print(emoji)
😀
>>> print(emoji_text)
\ud83d\ude00

而不是 "\?\?" 我想打印😀

我發現了一個有效但不實用的簡單技巧：

>>> import json
>>> json.loads('"\\ud83d\\ude00"')
'😀'

Answer 1

除了字符串中需要雙引號外，您的示例接近 JSON 的ensure_ascii=True字符串輸出。 它包含 U+FFFF 以上 Unicode 字符的 Unicode 轉義高/低代理。

請注意unicode-escape編解碼器不能單獨用於轉換。 它將創建一個帶有代理的 Unicode 字符串，這是非法的。 您將無法打印或編碼用於序列化的字符串。

>>> s = "\\ud83d\\ude00"
>>> s = s.encode('ascii').decode('unicode-escape')
>>> s
'\ud83d\ude00'
>>> print(s)  # UnicodeEncodeError: surrogates not allowed

將surrogatepass錯誤處理程序與utf-16編解碼器一起使用，您可以撤消代理並正確解碼字符串。 請注意，這也將解碼非代理轉義碼：

>>> s = "Hello\\u9a6c\\u514b\\ud83d\\ude00"
>>> s.encode('ascii').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')
'Hello馬克😀'

較舊的解決方案：

以下代碼將 Unicode 代理替換為其 Unicode 代碼點。 如果您有其他非代理 Unicode 轉義，它也會用它們的代碼點替換它們。

import re

def process(m):
    '''process(m) -> Unicode code point

    m is a regular expression match object that has groups below:
     1: high Unicode surrogate 4-digit hex code d800-dbff
     2: low  Unicode surrogate 4-digit hex code dc00-dfff
     3: None
    OR
     1: None
     2: None
     3: Unicode 4-digit hex code 0000-d700,e000-ffff
    '''
    if m.group(3) is None:
        # Construct code point from UTF-16 surrogates
        hi = int(m.group(1),16) & 0x3FF
        lo = int(m.group(2),16) & 0x3FF
        cp = 0x10000 | hi << 10 | lo
    else:
        cp = int(m.group(3),16)
    return chr(cp)

s = "Hello\\u9a6c\\u514b\\ud83d\\ude00"
s = re.sub(r'\\u(d[89ab][0-9a-f]{2})\\u(d[cdef][0-9a-f]{2})|\\u([0-9a-f]{4})',process,s)
print(s)

輸出：

Hello馬克😀

在 Python 中將 Unicode 轉義為表情符號

問題描述

1 個解決方案

解決方案1
3 已采納 2019-02-07 06:56:28

較舊的解決方案：

在 Python 中將 Unicode 轉義為表情符號

問題描述

1 個解決方案

解決方案1 3 已采納 2019-02-07 06:56:28

較舊的解決方案：

解決方案1
3 已采納 2019-02-07 06:56:28