简体   繁体   English

如何将unicode转换为unicode转义的文本

[英]How do I convert unicode to unicode-escaped text

I'm loading a file with a bunch of unicode characters (eg \\xe9\\x87\\x8b ). 我正在加载带有一堆Unicode字符的文件(例如\\xe9\\x87\\x8b )。 I want to convert these characters to their escaped-unicode form ( \釋 ) in Python. 我想将这些字符转换为Python中的转义Unicode形式( \釋 )。 I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3 , which does almost exactly what I want, but I can't work out how to save the data. 我在StackOverflow上发现了几个类似的问题,包括在Python3中的一个字符串中评估UTF-8文字转义序列的问题 ,它几乎完全符合我的要求,但是我不知道如何保存数据。

For example: Input file: 例如:输入文件:

\\xe9\\x87\\x8b

Python Script Python脚本

file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()

Output File (That I would like): 输出文件(我想要):

\釋

You need to encode it again with unicode-escape encoding. 您需要使用unicode-escape编码再次对其进行编码。

>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'

Code modified (used binary mode to reduce unnecessary encode/decodes) 修改代码(使用二进制模式以减少不必要的编码/解码)

with open("input.txt", "rb") as f:
    text = f.read().rstrip()  # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
    f.write(decoded.encode('unicode-escape'))

http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

\\xe9\\x87\\x8b is not a Unicode character. \\xe9\\x87\\x8b不是Unicode字符。 It looks like a representation of a bytestring that represents Unicode character encoded using utf-8 character encoding. 它看起来像一个字节字符串的表示形式 ,它表示使用utf-8字符编码编码的 Unicode字符。 \釋 is a representation of character in Python source code (or in JSON format). \釋是一个的表示 在Python源代码字符(或JSON格式)。 Don't confuse the text representation and the character itself: 不要混淆文本表示和字符本身:

>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'

To read text encoded as utf-8 from a file, specify the character encoding explicitly: 要从文件读取编码为utf-8的文本,请明确指定字符编码:

with open('input.txt', encoding='utf-8') as file:
    unicode_text = file.read()

It is exactly the same for saving Unicode text to a file: 将Unicode文本保存到文件中完全相同:

with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(unicode_text)

If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file. 如果locale.getpreferredencoding(False)encoding参数,则使用locale.getpreferredencoding(False) ,如果它与用于保存文件的实际字符编码不对应,则可能会产生locale.getpreferredencoding(False)

If your input file literally contains \\xe9 (4 characters) then you should fix whatever software generates it. 如果您的输入文件确实包含\\xe9 (4个字符),则应该修复所有软件来生成它。 If you need to use 'unicode-escape' ; 如果您需要使用'unicode-escape' something is broken. 东西坏了。

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference): 看起来您的输入文件似乎是UTF-8编码的,所以在打开文件时指定UTF-8编码(根据您的参考假设为Python3):

with open("input.txt", "r", encoding='utf8') as f:
    text = f.read()

text will contain the content of the file as a str (ie unicode string). text将以str (即unicode字符串)包含文件的内容。 Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape' : 现在,您可以通过指定encoding='unicode-escape' ,以unicode转义的形式将其直接写入文件:

with open('output.txt', 'w', encoding='unicode-escape') as f:
    f.write(text)

The content of your file will now contain unicode-escaped literals: 文件的内容现在将包含Unicode转义的文字:

$ cat output.txt
\u91cb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM