简体   繁体   中英

How to decode a text in python3?

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.

I tried the following

import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")

but none of them gives the correct result Aurélien .

How to do it correctly?

And is there no basic, general authoritative simple page that describes all these encodings for python?

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.

Try this:

import chardet

s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"

encoding = chardet.detect(bs)["encoding"]

str = s.encode(encoding).decode("utf-8")

print(str)

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937

You have UTF-8 decoded as latin-1 , so the solution is to encode as latin-1 then decode as UTF-8 .

s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))

Output
Aurélien

Your string is not a Unicode sequence, so you should prefix it with b

import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')

So you have the expected: 'Aurélien' .

If you want to use s , you should use mbcs , latin-1 , mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM