简体   繁体   English

在大型 CSV 文件中查找 #ROWS

[英]Finding #ROWS in a large CSV File

The aim is to find the total number of rows in a large CSV file.目的是查找大型 CSV 文件中的总行数。 I m using Python Dask to find it for now, but as the file size is around 45G it takes quite some time.我现在正在使用 Python Dask 来查找它,但由于文件大小约为 45G,因此需要相当长的时间。 Unix cat with wc -l seems to perform better.带有wc -l Unix cat似乎表现更好。

So the question is - Are there any tweaks for dask / pandas read_csv to make it find the total numbers of rows faster?所以问题是 - dask / pandas read_csv 是否有任何调整以使其更快地找到总行数?

Dask dataframe will spend 90% of its time parsing your text into various numerical types like int, float, etc.. You don't need any of this, so it's best not to make anything like a dataframe. Dask 数据框将花费 90% 的时间将您的文本解析为各种数字类型,如 int、float 等。您不需要任何这些,因此最好不要制作任何类似数据框的东西。

You could use dask.bag, which would be faster/simpler您可以使用 dask.bag,它会更快/更简单

dask.bag.read_text("...").count().compute()

But in truth wc -l is going to be about as fast as anything else.但实际上wc -l将与其他任何东西一样快。 You should be entirely bound by your disk speeds here, and not by compute power.您应该在这里完全受磁盘速度的约束,而不是受计算能力的约束。 Dask helps you to leverage multiple cores on your CPU, but those aren't the bottleneck in this case, so Dask isn't the right tool, wc is. Dask 可帮助您利用 CPU 上的多个内核,但在这种情况下,这些并不是瓶颈,因此 Dask 不是正确的工具,而wc才是。

You can try subprocess in python code:您可以在 python 代码中尝试subprocess进程:

fileName = "file.csv"
cmd = 'wc -l {0}'.format(fileName)
output = subprocess.call(cmd)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM