简体   繁体   中英

Python, Remove characters, such as emoji, that cannot be handled by UTF8 MySQL DB

How can I replace characters, such as emojis 😀, that cannot be handled by a UTF8 MySQL DB?

The key is to ONLY remove those characters that cannot be handled. I got this code from this answer removing emojis from a string in Python , but it's removing too much. (EDIT: This is the page that I got the code below from remove unicode emoji using re in python )

myre = re.compile(u'('
    u'\ud83c[\udf00-\udfff]|'
    u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
    u'[\u2600-\u26FF\u2700-\u27BF])+', 
    re.UNICODE)

 my_text= myre.sub(r'EMOJI', my_text)

For example, this heart symbol ♥ can be saved to the DB, but is caught by the above regexp.

MySQL's utf8 encodes precisely the basic multilingual plane (BMP). Rather than specifically emoji, you need to exclude all code points from supplementary planes, since in MySQL these require utf8mb4 .

Since you appear to be matching against 16 bit rather than 32 bit wide strings, a code point outside the BMP is encoded as a so-called "high surrogate" in the range 0xD800..0xDBFF , followed by a "low surrogate" in the range 0xDC00..0xDFFF . The corresponding regex therefore is:

u'[\?-\?][\?-\?]' .

♥ will not match this since it is u'\♥' . I think strictly speaking it's only an emoji if followed by the variation selector U+FE0F , but either way it's safely in the BMP.

>>> u"abcd ♥ \ud83c".encode("utf-8", errors="replace").decode("utf-8")
'abcd ♥ ?'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM