在Python 3中删除BMP（表情符号）之外的字符

Question

I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk 我有一个错误： UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk

I'm parsing the data, and some emoji's falls to array. 我正在解析数据，然后将一些表情符号分解为数组。 data = 'this variable contains some emoji'sツ😂' I want: data = 'this variable contains some emoji's' data = 'this variable contains some emoji'sツ😂'我想： data = 'this variable contains some emoji's'

How I can remove these characters from my data or handle this situation in Python 3? 如何从数据中删除这些字符或在Python 3中处理这种情况？

Answer 1

If the goal is just to remove all characters above '\' , the straightforward approach is to do just that: 如果目标只是删除'\'以上'\'所有字符，那么直接的方法就是这样做：

data = "this variable contains some emoji'sツ😂"
data = ''.join(c for c in data if c <= '\uFFFF')

It's possible your string is in decomposed form, so you may need to normalize it to composed form first so the non-BMP characters are identifiable: 您的字符串可能是分解形式的，因此您可能需要先将其normalize为组合形式，以便可以识别非BMP字符：

import unicodedata

data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')

Answer 2

>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"

For BMP read this: removing emojis from a string in Python 对于BMP，请阅读以下内容：从Python中的字符串中删除表情符号

在Python 3中删除BMP（表情符号）之外的字符

问题描述

2 个解决方案

解决方案1
6 已采纳 2016-03-29 13:03:03

解决方案2
-1 2016-03-29 13:12:00

在Python 3中删除BMP（表情符号）之外的字符

问题描述

2 个解决方案

解决方案1 6 已采纳 2016-03-29 13:03:03

解决方案2 -1 2016-03-29 13:12:00

解决方案1
6 已采纳 2016-03-29 13:03:03

解决方案2
-1 2016-03-29 13:12:00