简体   繁体   English

推荐使用python读取100GB .csv.gz文件

[英]recommendation for python to read 100GB .csv.gz file

I have a ".csv.gz" file that is 100GB large in a remote linux. 我在远程Linux中有一个100GB大的“ .csv.gz”文件。 I definitely do not want to unzip it because the size would reach to 1T. 我绝对不想解压缩它,因为它的大小可以达到1T。

I am looking online for reading files. 我正在网上寻找文件。 I saw on suggestion here 我在这里看到建议

python: read lines from compressed text files python:从压缩的文本文件中读取行

gzip? gzip? pandas? 大熊猫? iterator? 迭代器?

My mentor suggested to pip the data after unzip it. 我的导师建议将数据解压缩后再插入数据。

I would also need to consider the memory. 我还需要考虑内存。 So readlines() is definitely not my consideration. 因此,readlines()绝对不是我考虑的问题。

I wonder if anyone has an optimal solution for this because the file is really large and it would take me a lot of time to just do anything. 我想知道是否有人对此有一个最佳的解决方案,因为文件很大,执行任何操作都将花费我很多时间。

您可以将文件大块for line in sys.stdin: ...到python中,并像for line in sys.stdin: ...的一行一样for line in sys.stdin: ...处理它for line in sys.stdin: ...

zcat 100GB.csv.gz | python <my-app>

read the lines one by one by doing: 依次阅读以下内容:

import sys

for line in sys.stdin:
    do_sth_with_the_line(line)

You call this python script with: 您可以使用以下命令调用此python脚本:

zcat | python_script.py

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM