[英]Python: Bytes to string with accented characters
I have git
reading the file name "ùàèòùèòùùè.txt" as a simple string of bytes, so when I ask git for a list of commited files, I'm given the following string: 我将
git
读取为一个简单的字节字符串,读取文件名“ùàèòùèòùùùùè.txt”,因此,当我向git请求提交的文件列表时,会得到以下字符串:
r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
How can I use Python 2 to have it back to "ùàèòùèòùùè.txt"? 如何使用Python 2将其还原回“ùàèòùèèòùùùè.txt”?
If the git
format contains literal \\ddd
sequences (so up to 4 characters per filename byte) you can use the string_escape
(Python 2) or unicode_escape
(Python 3) codecs to have Python interpret the escape sequences. 如果
git
格式包含文字\\ddd
序列(每个文件名字节最多4个字符),则可以使用string_escape
(Python 2)或unicode_escape
(Python 3)编解码器让Python解释转义序列。
You'll get UTF-8 data; 您将获得UTF-8数据; my terminal is set to interpret UTF-8 directly:
我的终端设置为直接解释UTF-8:
>>> git_data = r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('string_escape')
'\xc3\xb9\xc3\xa0\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xb9\xc3\xa8.txt'
>>> print git_data.decode('string_escape')
ùàèòùèòùùè.txt
You'd want to decode that as UTF-8 to get text: 您需要将其解码为UTF-8以获得文本:
>>> git_data.decode('string_escape').decode('utf8')
u'\xf9\xe0\xe8\xf2\xf9\xe8\xf2\xf9\xf9\xe8.txt'
>>> print git_data.decode('string_escape').decode('utf8')
ùàèòùèòùùè.txt
In Python 3, the unicode_escape
codec gives you (Unicode) text so an extra encode to Latin-1 is required to make it bytes again: 在Python 3中,
unicode_escape
编解码器为您提供(Unicode)文本,因此需要对Latin-1进行额外的编码才能再次使其成为字节:
>>> git_data = rb"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('unicode_escape').encode('latin1').decode('utf8')
'ùàèòùèòùùè.txt'
Note that git_data
is a bytes
object before decoding. 注意
git_data
是解码前的bytes
对象。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.