Python：从非BMP Unicode字符中查找等效的代理对

Question

The answer presented here: How to work with surrogate pairs in Python? 此处给出的答案是：如何在Python中使用代理对？ tells you how to convert a surrogate pair, such as '\?\?' into a single non-BMP unicode character (the answer being "\?\?".encode('utf-16', 'surrogatepass').decode('utf-16') ). 告诉您如何将代理对（例如'\?\?'转换为单个非BMP unicode字符（答案为"\?\?".encode('utf-16', 'surrogatepass').decode('utf-16') ）。 I would like to know how to do this in reverse. 我想知道如何反向执行此操作。 How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\\U0001f64f' (🙏) back to '\?\?' . 我如何使用Python从非BMP字符中找到等效的代理对，然后将'\\U0001f64f' （🙏）转换回'\?\?' 。 I couldn't find a clear answer to that. 我找不到明确的答案。

Answer 1

You'll have to manually replace each non-BMP point with the surrogate pair. 您必须使用代理对手动替换每个非BMP点。 You could do this with a regular expression: 您可以使用正则表达式执行此操作：

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo: 演示：

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

Answer 2

It's a little complex, but here's a one-liner to convert a single character: 这有点复杂，但是这里有一个转换单个字符的衬里：

>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'

To convert a mix of characters requires surrounding that expression with another: 要转换字符混合，需要用另一个表达式包围：

>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'

Python：从非BMP Unicode字符中查找等效的代理对

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-10-24 16:28:32

解决方案2
3 2016-10-24 17:23:36

Python：从非BMP Unicode字符中查找等效的代理对

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-10-24 16:28:32

解决方案2 3 2016-10-24 17:23:36

解决方案1
5 已采纳 2016-10-24 16:28:32

解决方案2
3 2016-10-24 17:23:36