I'm reading some strings from a text file. Some of these strings have some "strange" characters, eg "\\xc3\\xa9comiam". If I copy that string and paste it into a variable, I can convert it to readable characters:
string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam
but if I read it from the file, it doesn't work:
with open(fn) as f:
for string in f.readlines():
print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam
It seems the solution must be pretty easy, but I can't find it. What can I do?
Thanks!
Those not unicode-escape
ones - like the name suggests, that handles Unicode sequences like \é
but not \\xe9
.
What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes
sequence which can then be decoded to a Unicode string.
# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))
The 'latin-1'
trick is a dirty secret which simply converts every byte to a character with the same character code.
For your file, you could open it in binary mode so you don't have to explictly convert it to bytes
, or you could simply apply the same conversion to the strings you read.
Thanks everyone for your help,
I think, I've found a solution (not very elegant, but it does the trick).
print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))
Thanks!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.