My system is :python3.3+win7.
The file c:\\\\test_before
is encode in gbk.you can download it and save it as c:\\\\test_before
from here to test.
http://pan.baidu.com/s/1i3DSuKd
I can get every line output when i set chcp 936
.
cname="c:\\test_before"
dat=open(cname,"r")
for line in dat.readlines():
print(line)
Now ,i change the file into utf-8 with python.
cname="c:\\test_before"
dat=open(cname,"rb")
new=open("c:\\test_utf-8","wb")
for line in dat.readlines():
line=line.decode("gbk").encode("utf-8")
new.write(line)
new.close()
when i set chcp 65001
,and run it
new=open("c:\\test_utf-8","r")
for line in new.readlines():
print(line)
why i got wrong output?
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 370: illegal multibyte sequence.
It's quite possible that Python does not detect the temporary codepage changes done with the chcp
command, so it would not use the correct encoding when you call open
. You can quite possibly verify that yourself by doing this:
>>> fd = open('/tmp/somefile.txt', 'r')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>
You can of course override this in Python 3, you can do something like:
>>> fd = open('/tmp/somefile.txt', 'r', encoding='UTF-8')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>
Making the encoding
parameter more explicit is probably what you want.
Also, you can also open the write side without using the binary mode (I saw you specifying 'wb'
. Just use 'w'
and be explicit about your target encoding if you are translating the encodings.
>>> fd2 = open('/tmp/write.txt', 'w', encoding='UTF-8')
>>> fd2.write(u'abcd話')
5
It returns number of characters written, however.
To complete your translation, you can definitely do something like
cname = "c:\\test_before"
dat = open(cname, "r", encoding="gbk")
new = open("c:\\test_utf-8", "w", encoding="utf-8")
for line in dat.readlines():
new.write(line)
new.close()
Finally, you should use the file handler context manager for consistency and saving you from needing to close files in this trivial use case, your code would look something like this:
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.