推荐使用python读取100GB .csv.gz文件

Question

I have a ".csv.gz" file that is 100GB large in a remote linux. 我在远程Linux中有一个100GB大的“ .csv.gz”文件。 I definitely do not want to unzip it because the size would reach to 1T. 我绝对不想解压缩它，因为它的大小可以达到1T。

I am looking online for reading files. 我正在网上寻找文件。 I saw on suggestion here 我在这里看到建议

python: read lines from compressed text files python：从压缩的文本文件中读取行

gzip? gzip？ pandas? 大熊猫？ iterator? 迭代器？

My mentor suggested to pip the data after unzip it. 我的导师建议将数据解压缩后再插入数据。

I would also need to consider the memory. 我还需要考虑内存。 So readlines() is definitely not my consideration. 因此，readlines（）绝对不是我考虑的问题。

I wonder if anyone has an optimal solution for this because the file is really large and it would take me a lot of time to just do anything. 我想知道是否有人对此有一个最佳的解决方案，因为文件很大，执行任何操作都将花费我很多时间。

Answer 1

您可以将文件大块for line in sys.stdin: ...到python中，并像for line in sys.stdin: ...的一行一样for line in sys.stdin: ...处理它for line in sys.stdin: ... ：

zcat 100GB.csv.gz | python <my-app>

Answer 2

read the lines one by one by doing: 依次阅读以下内容：

import sys

for line in sys.stdin:
    do_sth_with_the_line(line)

You call this python script with: 您可以使用以下命令调用此python脚本：

zcat | python_script.py

推荐使用python读取100GB .csv.gz文件

问题描述

2 个解决方案

解决方案1
0 2019-05-10 15:28:49

解决方案2
0 2019-05-10 15:29:14

推荐使用python读取100GB .csv.gz文件

问题描述

2 个解决方案

解决方案1 0 2019-05-10 15:28:49

解决方案2 0 2019-05-10 15:29:14

解决方案1
0 2019-05-10 15:28:49

解决方案2
0 2019-05-10 15:29:14