简体   繁体   中英

Python 3.4 - Remove or ignore emoji characters when writing to file

I'm trying to parse through an XML file and write the contents to a plain text file. I have the program working so far up until it hits an emoji character, then Python throws the following error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 177-181: character maps to <undefined>

I went to the error location and found the following emojis in the XML file:

表情符号

My question is how do either encode them to unicode or remove/ignore them completely when writing to file.

It outputs perfectly when I print() to the console, but throws an error when writing to file.

I have searched Google and here, but the only answers I am getting is that they are already encoded to unicode. Mine as you can see are, literals? I'm not sure if I'm saying that correctly.

Also the XML file I'm working with has the following format:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<?xml-stylesheet type="text/xsl" href="sms.xsl"?>
<smses count="1">
  <sms protocol="0" address="+00000000000" date="1346772606199" type="1" subject="null" body="Lorem ipsum dolor sit amet, consectetur adipisicing elit," toa="null" sc_toa="null" service_center="+00000000000" read="1" status="-1" locked="0" date_sent="1346772343000" readable_date="Sep 4, 2012 10:30:06 AM" contact_name="John Doe" />
</smses>

You have two options:

  1. Pick an encoding that can handle Emoji codepoints. You've opened your file for writing either with the default codec (which depends on your system), or picked an explicit encoding that doesn't support the codepoints.

    A UTF encoding would be able to handle the codepoints just fine; I'd pick UTF-8 here:

     with open(filename, 'w', encoding='utf8') as outfile: outfile.write(yourdata) 
  2. Set an error handling mode that either replaces codepoints your codec cannot handle with replacement characters, an escape sequence or ignores them altogether. See the open() function errors argument:

    errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

    • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
    • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
    • 'replace' causes a replacement marker (such as '?' ) to be inserted where there is malformed data.
    • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
    • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn; .
    • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python's backslashed escape sequences.

    So opening the file with errors='ignore' will not write Emoji codepoints instead of raising an error:

     with open(filename, 'w', errors='ignore') as outfile: outfile.write(yourdata) 

Demo:

>>> a_ok = 'The U+1F44C OK HAND SIGN codepoint: \U0001F44C'
>>> print(a_ok)
The U+1F44C OK HAND SIGN codepoint: 👌
>>> a_ok.encode('utf8')
b'The U+1F44C OK HAND SIGN codepoint: \xf0\x9f\x91\x8c'
>>> a_ok.encode('cp1251', errors='ignore')
b'The U+1F44C OK HAND SIGN codepoint: '
>>> a_ok.encode('cp1251', errors='replace')
b'The U+1F44C OK HAND SIGN codepoint: ?'
>>> a_ok.encode('cp1251', errors='xmlcharrefreplace')
b'The U+1F44C OK HAND SIGN codepoint: &#128076;'
>>> a_ok.encode('cp1251', errors='backslashreplace')
b'The U+1F44C OK HAND SIGN codepoint: \\U0001f44c'

Note that the 'surrogateescape' option has limited space and is really only useful for decoding a file with unknown encoding as best you can; it cannot handle Emoji in any case.

(Edit: This answer relevant to Python 2.x, not Python 3.x)

Currently you're writing unicode strings to the file with the default encoding, which doesn't support emoji (or, for that matter, a ton of characters that you probably really do want). You can instead write using the UTF-8 encoding, which supports all unicode characters.

Instead of doing file.write( data ) , try file.write( data.encode("utf-8") ) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM