'utf-8' 编解码器无法在 Python3.4 中解码字节读取文件，但不能在 Python2.7 中解码

Question

I was trying to read a file in python2.7, and it was readen perfectly.我试图在 python2.7 中读取一个文件，它被完美地读取。 The problem that I have is when I execute the same program in Python3.4 and then appear the error:我遇到的问题是当我在 Python3.4 中执行相同的程序然后出现错误时：

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

Also, when I run the program in Windows (with python3.4), the error doesn't appear.另外，当我在 Windows（使用 python3.4）中运行程序时，没有出现错误。 The first line of the document is: Codi;Codi_lloc_anonim;Nom文件的第一行是： Codi;Codi_lloc_anonim;Nom

and the code of my program is:我的程序代码是：

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

Answer 1

In Python2, 在Python2中，

f = open(filename,'r')
for line in f:

reads lines from the file as bytes . 从文件中读取字节行 。

In Python3, the same code reads lines from the file as strings . 在Python3中，相同的代码将文件中的行作为字符串读取。 Python3 strings are what Python2 call unicode objects. Python3字符串是Python2所谓的unicode对象。 These are bytes decoded according to some encoding. 这些是根据某种编码解码的字节。 The default encoding in Python3 is utf-8 . Python3中的默认编码是utf-8 。

The error message 错误讯息

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

shows Python3 is trying to decode the bytes as utf-8 . 显示Python3正在尝试将字节解码为utf-8 。 Since there is an error, the file apparently does not contain utf-8 encoded bytes . 由于存在错误，该文件显然不包含utf-8编码的bytes 。

To fix the problem you need to specify the correct encoding of the file: 要解决此问题，您需要指定文件的正确编码 ：

with open(filename, encoding=enc) as f:
    for line in f:

If you do not know the correct encoding, you could run this program to simply try all the encodings known to Python. 如果您不知道正确的编码，则可以运行该程序以尝试使用Python已知的所有编码。 If you are lucky there will be an encoding which turns the bytes into recognizable characters. 如果幸运的话，将会有一种编码将字节转换为可识别的字符。 Sometimes more than one encoding may appear to work, in which case you'll need to check and compare the results carefully. 有时可能会出现一种以上的编码方式，在这种情况下，您需要仔细检查和比较结果。

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Answer 2

Ok, I did the same as @unutbu tell me. 好的，我所做的与@unutbu告诉我的相同。 The result was a lot of encodings one of these are cp1250, for that reason I change : 结果是很多编码，其中之一是cp1250，因此我更改了：

f = open(filename,'r')

to 至

f = open(filename,'r', encoding='cp1250')

like @triplee suggest me. 就像@triplee建议我。 And now I can read my files. 现在，我可以读取文件了。

Answer 3

In my case I can't change encoding because my file is really UTF-8 encoded.就我而言，我无法更改编码，因为我的文件实际上是 UTF-8 编码的。 But some rows are corrupted and causes the same error:但是某些行已损坏并导致相同的错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte

My decision is to open file in binary mode :我的决定是以二进制模式打开文件：

open(filename, 'rb')

'utf-8' 编解码器无法在 Python3.4 中解码字节读取文件，但不能在 Python2.7 中解码

问题描述

3 个解决方案

解决方案1
16 已采纳 2015-03-05 12:38:25

解决方案2
0 2015-03-06 08:21:47

解决方案3
0 2021-11-23 14:46:32

&#39;utf-8&#39; 编解码器无法在 Python3.4 中解码字节读取文件，但不能在 Python2.7 中解码

问题描述

3 个解决方案

解决方案1 16 已采纳 2015-03-05 12:38:25

解决方案2 0 2015-03-06 08:21:47

解决方案3 0 2021-11-23 14:46:32

'utf-8' 编解码器无法在 Python3.4 中解码字节读取文件，但不能在 Python2.7 中解码

解决方案1
16 已采纳 2015-03-05 12:38:25

解决方案2
0 2015-03-06 08:21:47

解决方案3
0 2021-11-23 14:46:32