Python3：打印带有从非 ASCII 字符（unicode_escape）的文本文件中读取的表情符号的文本

Question

I want to read lines of a text-file that include emojis and non ASCII-characters and finally print them out.我想读取包含表情符号和非 ASCII 字符的文本文件行，最后将它们打印出来。 The problem is that I either can print the emoji glyphe correctly or the non ASCII-character (eg ü).问题是我要么可以正确打印表情符号字形，要么可以打印非 ASCII 字符（例如 ü）。

Line in text-file (with UTF-8 format):文本文件中的行（UTF-8 格式）：

I am tired.我累了。 - Ich bin müde \U0001F4A4 - Ich bin müde \U0001F4A4

Code to read:阅读代码：

with open(path_txt,"r", encoding="unicode_escape") as file:
    content = file.readlines()
    print(content[0])

With encoding="unicode_escape" I get the sleep-emoji and some cryptic character for "ü".使用 encoding="unicode_escape" 我得到了睡眠表情符号和“ü”的一些神秘字符。
With encoding="utf-8" (or default) it prints the unicode sequence \U0001F4A4 for the emoji and the correct "ü".使用 encoding="utf-8" （或默认值），它会为表情符号和正确的“ü”打印 unicode 序列 \U0001F4A4。 In the second case \U... gets double escaped to \U.在第二种情况下，\U... 被双重转义为 \U。 I thougt str.replace("\U", "\U") could be a workaround but ERROR:我认为 str.replace("\U", "\U") 可能是一种解决方法，但错误：

'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape 'unicodeescape' 编解码器无法解码 position 0-1 中的字节：截断 \UXXXXXXXXXX 转义

I also tried encoding="raw_unicode_escape".我还尝试了 encoding="raw_unicode_escape"。 As a beginner I don't understand the whole unicode topic.作为初学者，我不了解整个 unicode 主题。 Thanks for your help/workarounds!!感谢您的帮助/解决方法！！

Similar/Same Problem here (04/2014): https://bugs.python.org/issue21331类似/相同的问题（04/2014）： https://bugs.python.org/issue21331

Answer 1

It seems that the content is in some mixture of escapes (for the emoji) and UTF-8-encoded characters (for "ü").似乎内容混合了转义符（对于表情符号）和 UTF-8 编码字符（对于“ü”）。

It's not entirely clear from your post, but I assume if you would read the file in binary mode ( open(path, 'rb') ) and print the first line, you would see this:从您的帖子中并不完全清楚，但我假设如果您以二进制模式（ open(path, 'rb') ）读取文件并打印第一行，您会看到：

b'm\xc3\xbcde \\U0001f4a4'

This means that "ü" was encoded with UTF-8, but the emoji was escaped.这意味着“ü”是用 UTF-8 编码的，但是表情符号被转义了。 Note: You see escape sequences for "ü" too, but that's just the representation.注意：您也会看到“ü”的转义序列，但这只是表示形式。 Try len(b'\xc3') and you'll see that this is actually a length-1 byte string.试试len(b'\xc3')你会发现这实际上是一个长度为 1 字节的字符串。 b'\\U0001f4a4' on the other hand is really an escape sequence with length 10.另一方面， b'\\U0001f4a4'实际上是长度为 10 的转义序列。

Now the "unicode-escape" sequence does not expect exactly this format.现在“unicode-escape”序列并不期望这种格式。 It interprets unescaped non-ASCII characters as Latin-1 – that's why you see garbled characters instead of "ü" when using this codec:它将未转义的非 ASCII 字符解释为 Latin-1 - 这就是为什么在使用此编解码器时您会看到乱码字符而不是“ü”：

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('unicode-escape')
'mÃ¼de 💤'

But if "unicode-escape" wants Latin-1, we can give it, First: we decode with UTF-8 to get "ü" right:但是如果“unicode-escape”想要 Latin-1，我们可以给它，首先：我们用 UTF-8 解码得到正确的“ü”：

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8')
'müde \\U0001f4a4'

This doesn't touch the emoji escape, since it's all ASCII.这不会触及表情符号转义，因为它都是 ASCII。 Characters from the ASCII range are encoded identically for Latin-1 and UTF-8 (and ASCII).对于 Latin-1 和 UTF-8（和 ASCII），ASCII 范围内的字符编码相同。

Now we encode with Latin-1:现在我们用 Latin-1 编码：

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1')
b'm\xfcde \\U0001f4a4'

and this is something the "unicode-escape" codec understands:这是“unicode-escape”编解码器理解的东西：

>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1').decode('unicode-escape')
'müde 💤'

In your setup, you can defer the first decode step to the internal processing of open() :在您的设置中，您可以将第一个decode步骤推迟到open()的内部处理：

with open(path_txt, "r", encoding="utf-8") as file:
    for line in file:
        line = line.encode('latin1').decode('unicode-escape')
        # do something with line

Python3：打印带有从非 ASCII 字符（unicode_escape）的文本文件中读取的表情符号的文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-25 08:34:06

Python3：打印带有从非 ASCII 字符（unicode_escape）的文本文件中读取的表情符号的文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-25 08:34:06

解决方案1
1 已采纳 2020-04-25 08:34:06