简体   繁体   English

Modin 之间的比较 | 达斯克 | 数据表| 用于并行处理和内存不足 csv 文件的 Pandas

[英]Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

What are the fundamental difference and primary use-cases for Dask | Dask 的根本区别和主要用例是什么? Modin |莫丁 | Data.table数据表

I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations我检查了每个库的文档,它们似乎都为熊猫限制提供了“类似”的解决方案

I'm trying to decide which tool to learn of the three for parallel / out-of-memory computing: dask , modin or datatable ( pandas is not a parallel tool, nor is aimed at out-of-memory computing).我试图决定学习三对并口/外的内存计算的该工具: daskmodindatatablepandas是不是平行的工具,也不是为了出的内存计算)。

Didn't find any out-of-memory tools in datatable documentation (discussed here ), hence I'm only focusing on modin and dask .没有找到内存外的任何工具在datatable文件(讨论这里),所以我只专注于modindask

In short modin is trying to be a drop-in replacement for the pandas API, while dask is lazily evaluated.总之modin正试图成为一个下拉更换为pandas API,而dask是懒洋洋地评估。 modin is a column store, while dask partitions data frames by rows. modin是一个列存储,而dask分区的数据帧由行。 The distribution engine behind dask is centralized, while that of modin (called ray ) is not.背后的分发引擎dask集中,而中modin (被称为ray )不是。 Edit : Now modin supports dask as calculation engine too.编辑:现在modin支持dask作为计算引擎了。

dask was the first, has large eco-system and looks really well documented, discussed in forums and demonstrated on videos. dask是第一个,拥有庞大的生态系统,并且看起来非常有据可查,在论坛中进行了讨论并在视频中进行了演示。 modin ( ray ) has some design choices which allow it to be more flexible in terms of resilience for hardware errors and high-performance serialization. modin ( ray ) 有一些设计选择,使其在硬件错误的弹性和高性能序列化方面更加灵活。 ray aims at being most useful in AI research, but modin itself is of general use. ray目标是在 AI 研究中最有用,但modin本身是通用的。 ray also aims at real-time applications to support real-time reinforcement learning better. ray还针对实时应用,以更好地支持实时强化学习。

More details here and here .更多细节在这里这里

I have a task of dealing with daily stock trading data and came across this post.我有一项处理每日股票交易数据的任务,并遇到了这篇文章。 The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv and groupby mean .我的行的长度约为 6000 万,列的长度低于 10。我在read_csvgroupby mean对所有 3 个库进行了测试。 Based upon this little test my choice is dask .基于这个小测试,我的选择是dask Below is a comparison of the 3:下面是三者的对比:

| library      | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin        | 175s            | 150s           |
| dask         | 0s (lazy load)  | 27s            |
| dask persist | 26s             | 1s             |
| datatable    | 8s              | 6s             |

It seems that modin is not as efficient as dask at the moment, at least for my data.看来, modin是效率不高的dask此刻,至少我的数据。 dask persist tells dask that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. dask persist告诉dask您的数据可以放入内存中,因此 dask 需要一些时间将所有内容放入而不是延迟加载。 datatable originally has all data in memory and is super fast in both read_csv and groupby. datatable最初在内存中包含所有数据,并且在 read_csv 和 groupby 中都非常快。 However, given its incompatibility with pandas it seems better to use dask .但是,鉴于它与熊猫不兼容,使用dask似乎更好。 Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python.实际上我来自 R 并且我非常熟悉 R 的 data.table 所以我在 python 中应用它的语法没有问题。 If datatable in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.如果 python 中的datatable可以无缝连接到 Pandas(就像 R 中的 data.frame 那样),那么这将是我的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM