Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

Question

I'm working from an OpenOffice produced .csv with mixed roman and Chinese characters. This is an example of one row:

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'b'Open heart 'b'Happy '

This section contains two Chinese characters stored in binary which I would like displayed as Chinese characters on the command line from a very basic Python 3 program (see bottom), how do I do this?

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'

When I open the .csv in OpenOffice I need to select "Chinese Simplified UEC-CN" as the Character set if that helps. I have searched extensively but I do not understand Unicode and the pages do not make sense.

import csv
f = open('Chinese.csv', encoding="utf-8") 
file = csv.reader(f)

for line in file:
    for word in line:
        print(word.encode('utf-8'), end='')
    print("\n")

Thank you in advance for any suggestions.

Answer 1

Thanks to a suggestion by @eryksun I solved my issue by re-encoding the source file to UTF-8 from ASCII. The question is different but the solution is here :

http://www.stackoverflow.com/a/542899/792015

Alternatively if you are using Eclipse you can paste a non roman character (such as a Chinese character like 大 ) into your source code and save the file. If the source is not already UTF-8 Eclipse will offer to change it for you.

Thank you for all your suggestions and my apologies for answering my own question.

Footnote : If anyone knows why changing the source file type effects the compiled program I would love to know. According to https://docs.python.org/3/tutorial/interpreter.html the interpreter treats source files as UTF-8 by default.

Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

Question

1 answers

solution1
0 2014-05-15 04:51:53

Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

Question

1 answers

solution1 0 2014-05-15 04:51:53

solution1
0 2014-05-15 04:51:53