change file from gbk into utf-8 and display it in console

Question

My system is :python3.3+win7.
The file c:\\\\test_before is encode in gbk.you can download it and save it as c:\\\\test_before from here to test.
http://pan.baidu.com/s/1i3DSuKd
I can get every line output when i set chcp 936 .

cname="c:\\test_before"
dat=open(cname,"r")
for line in dat.readlines():
    print(line)

在此处输入图片说明

Now ,i change the file into utf-8 with python.

cname="c:\\test_before"
dat=open(cname,"rb")
new=open("c:\\test_utf-8","wb")
for line in dat.readlines():
    line=line.decode("gbk").encode("utf-8")
    new.write(line)

new.close()

when i set chcp 65001 ,and run it

new=open("c:\\test_utf-8","r")
for line in new.readlines():
    print(line)

why i got wrong output?
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 370: illegal multibyte sequence.

Answer 1

It's quite possible that Python does not detect the temporary codepage changes done with the chcp command, so it would not use the correct encoding when you call open . You can quite possibly verify that yourself by doing this:

>>> fd = open('/tmp/somefile.txt', 'r')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

You can of course override this in Python 3, you can do something like:

>>> fd = open('/tmp/somefile.txt', 'r', encoding='UTF-8')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

Making the encoding parameter more explicit is probably what you want.

Also, you can also open the write side without using the binary mode (I saw you specifying 'wb' . Just use 'w' and be explicit about your target encoding if you are translating the encodings.

>>> fd2 = open('/tmp/write.txt', 'w', encoding='UTF-8')
>>> fd2.write(u'abcd話')
5

It returns number of characters written, however.

To complete your translation, you can definitely do something like

cname = "c:\\test_before"
dat = open(cname, "r", encoding="gbk")
new = open("c:\\test_utf-8", "w", encoding="utf-8")
for line in dat.readlines():
    new.write(line)

new.close()

Finally, you should use the file handler context manager for consistency and saving you from needing to close files in this trivial use case, your code would look something like this:

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)

change file from gbk into utf-8 and display it in console

Question

1 answers

solution1
0 ACCPTED 2014-03-19 07:56:55

change file from gbk into utf-8 and display it in console

Question

1 answers

solution1 0 ACCPTED 2014-03-19 07:56:55

solution1
0 ACCPTED 2014-03-19 07:56:55