[英]recommendation for python to read 100GB .csv.gz file
I have a ".csv.gz" file that is 100GB large in a remote linux. 我在远程Linux中有一个100GB大的“ .csv.gz”文件。 I definitely do not want to unzip it because the size would reach to 1T. 我绝对不想解压缩它,因为它的大小可以达到1T。
I am looking online for reading files. 我正在网上寻找文件。 I saw on suggestion here 我在这里看到建议
python: read lines from compressed text files python:从压缩的文本文件中读取行
gzip? gzip? pandas? 大熊猫? iterator? 迭代器?
My mentor suggested to pip the data after unzip it. 我的导师建议将数据解压缩后再插入数据。
I would also need to consider the memory. 我还需要考虑内存。 So readlines() is definitely not my consideration. 因此,readlines()绝对不是我考虑的问题。
I wonder if anyone has an optimal solution for this because the file is really large and it would take me a lot of time to just do anything. 我想知道是否有人对此有一个最佳的解决方案,因为文件很大,执行任何操作都将花费我很多时间。
您可以将文件大块for line in sys.stdin: ...
到python中,并像for line in sys.stdin: ...
的一行一样for line in sys.stdin: ...
处理它for line in sys.stdin: ...
:
zcat 100GB.csv.gz | python <my-app>
read the lines one by one by doing: 依次阅读以下内容:
import sys
for line in sys.stdin:
do_sth_with_the_line(line)
You call this python script with: 您可以使用以下命令调用此python脚本:
zcat | python_script.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.