简体   繁体   English

'utf-8' 编解码器无法在 Python3.4 中解码字节读取文件,但不能在 Python2.7 中解码

[英]'utf-8' codec can't decode byte reading a file in Python3.4 but not in Python2.7

I was trying to read a file in python2.7, and it was readen perfectly.我试图在 python2.7 中读取一个文件,它被完美地读取。 The problem that I have is when I execute the same program in Python3.4 and then appear the error:我遇到的问题是当我在 Python3.4 中执行相同的程序然后出现错误时:

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

Also, when I run the program in Windows (with python3.4), the error doesn't appear.另外,当我在 Windows(使用 python3.4)中运行程序时,没有出现错误。 The first line of the document is: Codi;Codi_lloc_anonim;Nom文件的第一行是: Codi;Codi_lloc_anonim;Nom

and the code of my program is:我的程序代码是:

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

In Python2, 在Python2中,

f = open(filename,'r')
for line in f:

reads lines from the file as bytes . 从文件中读取字节行

In Python3, the same code reads lines from the file as strings . 在Python3中,相同的代码将文件中的行作为字符串读取。 Python3 strings are what Python2 call unicode objects. Python3字符串是Python2所谓的unicode对象。 These are bytes decoded according to some encoding. 这些是根据某种编码解码的字节。 The default encoding in Python3 is utf-8 . Python3中的默认编码是utf-8

The error message 错误讯息

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

shows Python3 is trying to decode the bytes as utf-8 . 显示Python3正在尝试将字节解码为utf-8 Since there is an error, the file apparently does not contain utf-8 encoded bytes . 由于存在错误,该文件显然不包含utf-8编码的bytes

To fix the problem you need to specify the correct encoding of the file: 要解决此问题,您需要指定文件的正确编码

with open(filename, encoding=enc) as f:
    for line in f:

If you do not know the correct encoding, you could run this program to simply try all the encodings known to Python. 如果您不知道正确的编码,则可以运行该程序以尝试使用Python已知的所有编码。 If you are lucky there will be an encoding which turns the bytes into recognizable characters. 如果幸运的话,将会有一种编码将字节转换为可识别的字符。 Sometimes more than one encoding may appear to work, in which case you'll need to check and compare the results carefully. 有时可能会出现一种以上的编码方式,在这种情况下,您需要仔细检查和比较结果。

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Ok, I did the same as @unutbu tell me. 好的,我所做的与@unutbu告诉我的相同。 The result was a lot of encodings one of these are cp1250, for that reason I change : 结果是很多编码,其中之一是cp1250,因此我更改了:

f = open(filename,'r')

to

f = open(filename,'r', encoding='cp1250')

like @triplee suggest me. 就像@triplee建议我。 And now I can read my files. 现在,我可以读取文件了。

In my case I can't change encoding because my file is really UTF-8 encoded.就我而言,我无法更改编码,因为我的文件实际上是 UTF-8 编码的。 But some rows are corrupted and causes the same error:但是某些行已损坏并导致相同的错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte

My decision is to open file in binary mode :我的决定是以二进制模式打开文件:

open(filename, 'rb')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError:'utf-8'编解码器无法解码 position 0 中的字节 0xff:读取 csv 时 python 中的无效起始字节错误 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte error in python while reading a csv file UnicodeDecodeError when reading CSV file in Pandas with Python “'utf-8' codec can't decode byte 0xff in position 0: invalid start byte” - UnicodeDecodeError when reading CSV file in Pandas with Python “'utf-8' codec can't decode byte 0xff in position 0: invalid start byte” “utf-8”编解码器无法解码位置 2912 中的字节 0xd5:在 Python 中读取 csv 文件时出现无效的连续字节错误 - 'utf-8' codec can't decode byte 0xd5 in position 2912: invalid continuation byte Error when reading csv file in Python UnicodeDecodeError:“ utf-8”编解码器无法解码字节(python) - UnicodeDecodeError: 'utf-8' codec can't decode byte (python) “utf-8”编解码器无法解码字节 - Python - 'utf-8' codec can't decode byte - Python Python 'utf-8' 编解码器无法解码字节 0xe0 - Python 'utf-8' codec can't decode byte 0xe0 Python UnicodeDecodeError: 'utf-8' 编解码器无法解码字节 - Python UnicodeDecodeError: 'utf-8' codec can't decode byte 使用utf-8标头的python 2.7解码错误:UnicodeDecodeError:'ascii'编解码器无法解码字节0xc3 - Python 2.7 decode error using UTF-8 header: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 Python 3 CSV 文件给出 UnicodeDecodeError: 'utf-8' 编解码器在打印时无法解码字节错误 - Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print 'utf8'编解码器在python中解码('utf-8')时无法解码字节0xc3 - 'utf8' codec can't decode byte 0xc3 while decode('utf-8') in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM