简体   繁体   中英

change file from gbk into utf-8 and display it in console

My system is :python3.3+win7.
The file c:\\\\test_before is encode in gbk.you can download it and save it as c:\\\\test_before from here to test.
http://pan.baidu.com/s/1i3DSuKd
I can get every line output when i set chcp 936 .

cname="c:\\test_before"
dat=open(cname,"r")
for line in dat.readlines():
    print(line)

在此处输入图片说明

Now ,i change the file into utf-8 with python.

cname="c:\\test_before"
dat=open(cname,"rb")
new=open("c:\\test_utf-8","wb")
for line in dat.readlines():
    line=line.decode("gbk").encode("utf-8")
    new.write(line)

new.close()

when i set chcp 65001 ,and run it

new=open("c:\\test_utf-8","r")
for line in new.readlines():
    print(line)

why i got wrong output?
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 370: illegal multibyte sequence.

It's quite possible that Python does not detect the temporary codepage changes done with the chcp command, so it would not use the correct encoding when you call open . You can quite possibly verify that yourself by doing this:

>>> fd = open('/tmp/somefile.txt', 'r')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

You can of course override this in Python 3, you can do something like:

>>> fd = open('/tmp/somefile.txt', 'r', encoding='UTF-8')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

Making the encoding parameter more explicit is probably what you want.

Also, you can also open the write side without using the binary mode (I saw you specifying 'wb' . Just use 'w' and be explicit about your target encoding if you are translating the encodings.

>>> fd2 = open('/tmp/write.txt', 'w', encoding='UTF-8')
>>> fd2.write(u'abcd話')
5

It returns number of characters written, however.

To complete your translation, you can definitely do something like

cname = "c:\\test_before"
dat = open(cname, "r", encoding="gbk")
new = open("c:\\test_utf-8", "w", encoding="utf-8")
for line in dat.readlines():
    new.write(line)

new.close()

Finally, you should use the file handler context manager for consistency and saving you from needing to close files in this trivial use case, your code would look something like this:

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM