[英]Remove characters outside of the BMP (emoji's) in Python 3
I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk
我有一个错误:
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk
I'm parsing the data, and some emoji's falls to array. 我正在解析数据,然后将一些表情符号分解为数组。
data = 'this variable contains some emoji'sツ😂'
I want: data = 'this variable contains some emoji's'
data = 'this variable contains some emoji'sツ😂'
我想: data = 'this variable contains some emoji's'
How I can remove these characters from my data or handle this situation in Python 3? 如何从数据中删除这些字符或在Python 3中处理这种情况?
If the goal is just to remove all characters above '\'
, the straightforward approach is to do just that: 如果目标只是删除
'\'
以上'\'
所有字符,那么直接的方法就是这样做:
data = "this variable contains some emoji'sツ😂"
data = ''.join(c for c in data if c <= '\uFFFF')
It's possible your string is in decomposed form, so you may need to normalize
it to composed form first so the non-BMP characters are identifiable: 您的字符串可能是分解形式的,因此您可能需要先将其
normalize
为组合形式,以便可以识别非BMP字符:
import unicodedata
data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"
For BMP read this: removing emojis from a string in Python 对于BMP,请阅读以下内容: 从Python中的字符串中删除表情符号
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.