Python encoding problem when reading but not when typing

Question

I'm reading some strings from a text file. Some of these strings have some "strange" characters, eg "\\xc3\\xa9comiam". If I copy that string and paste it into a variable, I can convert it to readable characters:

string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam

but if I read it from the file, it doesn't work:

with open(fn) as f:
       for string in f.readlines():
          print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam

It seems the solution must be pretty easy, but I can't find it. What can I do?

Thanks!

Answer 1

Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like \é but not \\xe9 .

What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.

# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))

The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.

For your file, you could open it in binary mode so you don't have to explictly convert it to bytes , or you could simply apply the same conversion to the strings you read.

Answer 2

Thanks everyone for your help,

I think, I've found a solution (not very elegant, but it does the trick).

print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))

Thanks!

Python encoding problem when reading but not when typing

Question

2 answers

solution1
0 2019-04-05 15:31:51

solution2
0 ACCPTED 2019-04-08 09:32:26

Python encoding problem when reading but not when typing

Question

2 answers

solution1 0 2019-04-05 15:31:51

solution2 0 ACCPTED 2019-04-08 09:32:26

solution1
0 2019-04-05 15:31:51

solution2
0 ACCPTED 2019-04-08 09:32:26