Python从gzip压缩文件中读取csv行

Question

I am trying to parse a gzipped csv file (where the fields are separated by | characters), to test if reading the file directly in Python will be faster than zcat file.gz | python 我正在尝试解析gzip压缩的csv文件（其中的字段由|字符分隔），以测试是否直接在Python中读取文件是否比zcat file.gz | python更快zcat file.gz | python zcat file.gz | python in parsing the contents. zcat file.gz | python在解析内容。

I have the following code: 我有以下代码：

#!/usr/bin/python3

import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.split('|')
        total += int(split_line[52])
        count += 1

    print(count, " :: ", total)

But I get the following error: 但是我收到以下错误：

$ ./PyZip.py 
Traceback (most recent call last):
  File "./PyZip.py", line 11, in <module>
    split_line = line.split('|')
TypeError: a bytes-like object is required, not 'str'

How can I modify this to read the line and split it properly? 我如何修改它以读取行并正确分割？

I'm interested mainly in just the 52nd field as delimited by |. 我主要对|分隔的第52个字段感兴趣。 The lines in my input file are like the following: 输入文件中的行如下所示：

Is there a faster way than what I have in summing all the values in the 52nd field? 是否有比我将第52个字段中的所有值求和的方法更快的方法？

Thanks! 谢谢！

Answer 1

You should decode the line first before splitting, since unzipped files are read as bytes: 您应该先对行进行解码，然后再进行拆分，因为解压缩后的文件将以字节读取：

split_line = line.decode('utf-8').split('|')

The code you have for summing all the values in the 52nd field is fine. 您将第52个字段中的所有值相加的代码很好。 There's no way to make it faster because all the lines simply have to be read and split in order to identify the 52th field of every line. 没有办法使其更快，因为仅需读取和拆分所有行即可识别每行的第52个字段。

Answer 2

Just try decoding the bytes object to a string. 只需尝试将bytes对象解码为字符串即可。 ie, 即

line.decode('utf-8') line.decode（ 'UTF-8'）

Updated script : 更新的脚本：

#!/usr/bin/python3
import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.decode("utf-8").split('|')
         total += int(split_line[52])
         count += 1

    print(count, " :: ", total)

Python从gzip压缩文件中读取csv行

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-08-01 03:05:57

解决方案2
1 2018-08-01 03:26:25

Python从gzip压缩文件中读取csv行

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-08-01 03:05:57

解决方案2 1 2018-08-01 03:26:25

解决方案1
1 已采纳 2018-08-01 03:05:57

解决方案2
1 2018-08-01 03:26:25