简体   繁体   English

如何在python3中解码文本?

[英]How to decode a text in python3?

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.我有一个文本Aur\xc3\xa9lien并想用 python 3.8 对其进行解码。

I tried the following我尝试了以下

import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")

but none of them gives the correct result Aurélien .但他们都没有给出正确的结果Aurélien

How to do it correctly?如何正确执行?

And is there no basic, general authoritative simple page that describes all these encodings for python?没有基本的、通用的权威简单页面来描述 python 的所有这些编码吗?

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.首先找到字符串的编码,然后对其进行解码......为此,您需要通过在原始字符串的前面添加字母“b”来创建一个字节字符串。

Try this:尝试这个:

import chardet

s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"

encoding = chardet.detect(bs)["encoding"]

str = s.encode(encoding).decode("utf-8")

print(str)

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937如果您正在从文件中读取文本,则可以使用magic库检测编码,请参见此处: https://stackoverflow.com/a/16203777/1544937

You have UTF-8 decoded as latin-1 , so the solution is to encode as latin-1 then decode as UTF-8 .您已将UTF-8解码为latin-1 ,因此解决方案是编码为latin-1 ,然后解码为UTF-8

s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))

Output
Aurélien

Your string is not a Unicode sequence, so you should prefix it with b你的字符串不是 Unicode 序列,所以你应该在它前面加上 b

import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')

So you have the expected: 'Aurélien' .所以你有预期的: 'Aurélien'

If you want to use s , you should use mbcs , latin-1 , mac_roman or any 8-bit encoding.如果你想使用s ,你应该使用mbcslatin-1mac_roman或任何 8 位编码。 It doesn't matter.没关系。 Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping).这样的 8 位编解码器可以正确获取字符串中的二进制字符(1 对 1 映射)。 So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.所以你得到一个字节数组(所以现在你可以使用这个答案的第一部分,所以你可以解码二进制字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM