简体   繁体   English

当文件大小大于40K字节时,为什么csv.reader失败?

[英]Why does csv.reader fail when the file size is larger than 40K bytes?

I have the following code: 我有以下代码:

with open(filename, 'rt') as csvfile:
    csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
    for row in csvDictReader:
        print(row)

Whenever the file size is less than 40k bytes, the program works great. 只要文件大小小于40k字节,该程序就可以正常工作。 When the file size crosses 40k, I get this error while trying to read the file: 当文件大小超过40k时,在尝试读取文件时出现此错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte

The actual file content doesn't seem to be a problem, only the size of the file itself (40k bytes is really tiny). 实际的文件内容似乎不是问题,仅是文件本身的大小(40k字节确实很小)。

When file size is greater than 40K bytes, the error always happens on the line that contains the 32K-th byte. 当文件大小大于40K字节时,总是在包含第32K个字节的行上发生错误。

I have a feeling that python fails to read the file that is more than 40K bytes without an exception, and just truncates it around the 32K-th byte, in the middle. 我有一种感觉,python毫无例外地无法读取超过40K字节的文件,而只是在中间将第32K字节附近截断了文件。 Is that correct? 那是对的吗? Where is this limit defined? 该限制在哪里定义?

You have invalid UTF-8 data in your file. 您的文件中包含无效的UTF-8数据。 This has nothing to do with the csv module, nor the size of the file; 这与csv模块无关,也与文件大小无关。 your larger file has invalid data in it, your smaller file does not. 您的较大文件中包含无效数据,而较小文件中没有。 Simply doing: 简单地做:

 with open(filename) as f:
     f.read()

should trigger the same error, and it's purely a matter of encountering an invalid UTF-8 byte, which indicates your file either wasn't UTF-8 to start with, or has been corrupted in some way. 应该会触发相同的错误,这纯粹是遇到无效的UTF-8字节的问题,这表明您的文件不是以UTF-8开头的,或者已经以某种方式损坏了。

If your file is actually a different encoding (eg latin-1 , cp1252 , etc.; the file command line utility might help with identification, but for many ASCII superset encodings, you just have to know ), pass that as the encoding argument to open to use instead of the locale default ( utf-8 in this case), so you can decode the bytes properly, eg: 如果您的文件实际上是不同的编码(例如latin-1cp1252等; file命令行实用程序可能有助于识别,但是对于许多ASCII超集编码,您只需要知道 ),请将其作为encoding参数传递给open使用而不是默认的语言环境(在这种情况下为utf-8 ),因此您可以正确解码字节,例如:

    # Also add newline='' to defer newline processing to csv module, where it's part
    # of the CSV dialect
    with open(filename, encoding='latin-1', newline='') as csvfile:
        csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
        for row in csvDictReader:
            print(row)

the file size is not the real problem see the exception: 文件大小不是真正的问题,请参见异常:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte UnicodeDecodeError:'utf-8'编解码器无法解码位置7206中的字节0xa0:无效的起始字节

you should handle the encoding problem at first 您应该首先处理编码问题

with open(filename, 'rt', encoding='utf-8', errors='ignore') as csvfile:

which will ignore the encoding errors 它将忽略编码错误

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM