简体   繁体   中英

Python encoding problem when reading but not when typing

I'm reading some strings from a text file. Some of these strings have some "strange" characters, eg "\\xc3\\xa9comiam". If I copy that string and paste it into a variable, I can convert it to readable characters:

string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam

but if I read it from the file, it doesn't work:

with open(fn) as f:
       for string in f.readlines():
          print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam

It seems the solution must be pretty easy, but I can't find it. What can I do?

Thanks!

Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like but not \\xe9 .

What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.

# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))

The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.

For your file, you could open it in binary mode so you don't have to explictly convert it to bytes , or you could simply apply the same conversion to the strings you read.

Thanks everyone for your help,

I think, I've found a solution (not very elegant, but it does the trick).

print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))

Thanks!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM