I have a text Aur\xc3\xa9lien
and want to decode it with python 3.8.
I tried the following
import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")
but none of them gives the correct result Aurélien
.
How to do it correctly?
And is there no basic, general authoritative simple page that describes all these encodings for python?
First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.
Try this:
import chardet
s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"
encoding = chardet.detect(bs)["encoding"]
str = s.encode(encoding).decode("utf-8")
print(str)
If you are reading the text from a file you can detect the encoding using the magic
lib, see here: https://stackoverflow.com/a/16203777/1544937
You have UTF-8
decoded as latin-1
, so the solution is to encode as latin-1
then decode as UTF-8
.
s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))
Output
Aurélien
Your string is not a Unicode sequence, so you should prefix it with b
import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')
So you have the expected: 'Aurélien'
.
If you want to use s
, you should use mbcs
, latin-1
, mac_roman
or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.