简体   繁体   English

如何快速将大型CSV文件读入Python?

[英]How can I read a large CSV file into Python with speed?

I'm trying to load a ~67 gb dataframe (6,000,000 features by 2300 rows) into dask for machine learning. 我正在尝试将〜67 gb数据帧(6,000,000个特征乘以2300行)加载到dask中以进行机器学习。 I'm using a 96 core machine on AWS that I wish to utilize for the actual machine learning bit. 我正在AWS上使用96核心机器,希望将其用于实际机器学习。 However, Dask loads CSVs in a single thread. 但是,Dask在单个线程中加载CSV。 It has already taken a full 24 hours and it hasn't loaded. 它已经用了整整24小时,还没有加载。

#I tried to display a progress bar, but it is not implemented on dask's load_csv
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()

df = dd.read_csv('../Larger_than_the_average_CSV.csv')

Is there a faster way to load this into Dask and make it persistent? 有没有更快的方法可以将其加载到Dask中并使其持久化? Should I switch to a different technology (Spark on Scala or PySpark?) 我应该切换到其他技术(Scala上的Spark还是PySpark?)

Dask is probably still loading it as I can see a steady 100% CPU utilization in top . Dask可能仍在加载它,因为我可以看到top CPU使用率稳定达到100%。

The code you show in the question probably takes no time at all, because you are not actually loading anything, just setting up the job prescription. 您在问题中显示的代码可能根本不需要时间,因为您实际上并没有加载任何东西,只是设置了工作处方。 How long this takes will depend on the chunksize you specify. 这需要多长时间取决于您指定的块大小。

There are two main bottlenecks to consider for actual loading: 实际加载要考虑两个主要瓶颈:

  • getting the data from disc into memory, raw data transfer over a single disc interface, 将数据从光盘放入内存,通过单个光盘接口传输原始数据,
  • parsing that data into in-memory stuff 将该数据解析为内存中的内容

There is not much you can do about the former if you are on a local disc, and you would expect it to be a small fraction. 如果您在本地磁盘上,则对于前者您无能为力,并且您希望它只是一小部分。

The latter may suffer from the GIL, even though dask will execute in multiple threads by default (which is why it may appear only one thread is being used). 即使默认情况下dask会在多个线程中执行,后者也可能会受到GIL的影响(这就是为什么它可能只显示使用一个线程的原因)。 You would do well to read the dask documentation about the different schedulers, and should try using the Distributed scheduler, even though you are on a single machine, with a mix of threads and processes. 您最好阅读有关不同调度程序的简要文档,并且即使您位于一台计算机上,也要混合使用线程和进程,并且应该尝试使用分布式调度程序。

Finally, you probably don't want to "load" the data at all, but process it. 最后,您可能根本不想“加载”数据,而是对其进行处理。 Yes, you can persist into memory with Dask if you wish ( dask.persist , funnily), but please do not use many workers to load the data just so you then make it into a Pandas dataframe in your client process memory. 是的,您可以根据需要将Dask保留到内存中( dask.persist ,很有趣),但是请不要使用太多的工作程序来加载数据,这样您就可以将其放入客户端进程内存中的Pandas数据帧中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM