简体   繁体   English

Python从gzip压缩文件中读取csv行

[英]Python read csv line from gzipped file

I am trying to parse a gzipped csv file (where the fields are separated by | characters), to test if reading the file directly in Python will be faster than zcat file.gz | python 我正在尝试解析gzip压缩的csv文件(其中的字段由|字符分隔),以测试是否直接在Python中读取文件是否比zcat file.gz | python更快zcat file.gz | python zcat file.gz | python in parsing the contents. zcat file.gz | python在解析内容。

I have the following code: 我有以下代码:

#!/usr/bin/python3

import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.split('|')
        total += int(split_line[52])
        count += 1

    print(count, " :: ", total)

But I get the following error: 但是我收到以下错误:

$ ./PyZip.py 
Traceback (most recent call last):
  File "./PyZip.py", line 11, in <module>
    split_line = line.split('|')
TypeError: a bytes-like object is required, not 'str'

How can I modify this to read the line and split it properly? 我如何修改它以读取行并正确分割?

I'm interested mainly in just the 52nd field as delimited by |. 我主要对|分隔的第52个字段感兴趣。 The lines in my input file are like the following: 输入文件中的行如下所示:

field1|field2|field3|...field52|field53 FIELD1 |科研成果|场3 | ... field52 | field53

Is there a faster way than what I have in summing all the values in the 52nd field? 是否有比我将第52个字段中的所有值求和的方法更快的方法?

Thanks! 谢谢!

You should decode the line first before splitting, since unzipped files are read as bytes: 您应该先对行进行解码,然后再进行拆分,因为解压缩后的文件将以字节读取:

split_line = line.decode('utf-8').split('|')

The code you have for summing all the values in the 52nd field is fine. 您将第52个字段中的所有值相加的代码很好。 There's no way to make it faster because all the lines simply have to be read and split in order to identify the 52th field of every line. 没有办法使其更快,因为仅需读取和拆分所有行即可识别每行的第52个字段。

Just try decoding the bytes object to a string. 只需尝试将bytes对象解码为字符串即可。 ie,

line.decode('utf-8') line.decode( 'UTF-8')

Updated script : 更新的脚本:

#!/usr/bin/python3
import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.decode("utf-8").split('|')
         total += int(split_line[52])
         count += 1

    print(count, " :: ", total)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM