简体   繁体   中英

Python3: print text with emojis read from text-file with non ASCII-characters (unicode_escape)

I want to read lines of a text-file that include emojis and non ASCII-characters and finally print them out. The problem is that I either can print the emoji glyphe correctly or the non ASCII-character (eg ü).

Line in text-file (with UTF-8 format):

I am tired. - Ich bin müde \U0001F4A4

Code to read:

with open(path_txt,"r", encoding="unicode_escape") as file:
    content = file.readlines()
    print(content[0])
  1. With encoding="unicode_escape" I get the sleep-emoji and some cryptic character for "ü".
  2. With encoding="utf-8" (or default) it prints the unicode sequence \U0001F4A4 for the emoji and the correct "ü". In the second case \U... gets double escaped to \U. I thougt str.replace("\U", "\U") could be a workaround but ERROR:

'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape

I also tried encoding="raw_unicode_escape". As a beginner I don't understand the whole unicode topic. Thanks for your help/workarounds!!

Similar/Same Problem here (04/2014): https://bugs.python.org/issue21331

It seems that the content is in some mixture of escapes (for the emoji) and UTF-8-encoded characters (for "ü").

It's not entirely clear from your post, but I assume if you would read the file in binary mode ( open(path, 'rb') ) and print the first line, you would see this:

b'm\xc3\xbcde \\U0001f4a4'

This means that "ü" was encoded with UTF-8, but the emoji was escaped. Note: You see escape sequences for "ü" too, but that's just the representation. Try len(b'\xc3') and you'll see that this is actually a length-1 byte string. b'\\U0001f4a4' on the other hand is really an escape sequence with length 10.

Now the "unicode-escape" sequence does not expect exactly this format. It interprets unescaped non-ASCII characters as Latin-1 – that's why you see garbled characters instead of "ü" when using this codec:

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('unicode-escape')
'müde 💤'

But if "unicode-escape" wants Latin-1, we can give it, First: we decode with UTF-8 to get "ü" right:

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8')
'müde \\U0001f4a4'

This doesn't touch the emoji escape, since it's all ASCII. Characters from the ASCII range are encoded identically for Latin-1 and UTF-8 (and ASCII).

Now we encode with Latin-1:

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1')
b'm\xfcde \\U0001f4a4'

and this is something the "unicode-escape" codec understands:

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1').decode('unicode-escape')
'müde 💤'

In your setup, you can defer the first decode step to the internal processing of open() :

with open(path_txt, "r", encoding="utf-8") as file:
    for line in file:
        line = line.encode('latin1').decode('unicode-escape')
        # do something with line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM