[英]Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files
What are the fundamental difference and primary use-cases for Dask | Dask 的根本区别和主要用例是什么? Modin |莫丁 | Data.table数据表
I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations我检查了每个库的文档,它们似乎都为熊猫限制提供了“类似”的解决方案
I'm trying to decide which tool to learn of the three for parallel / out-of-memory computing: dask
, modin
or datatable
( pandas
is not a parallel tool, nor is aimed at out-of-memory computing).我试图决定学习三对并口/外的内存计算的该工具: dask
, modin
或datatable
( pandas
是不是平行的工具,也不是为了出的内存计算)。
Didn't find any out-of-memory tools in datatable
documentation (discussed here ), hence I'm only focusing on modin
and dask
.没有找到内存外的任何工具在datatable
文件(讨论这里),所以我只专注于modin
和dask
。
In short modin
is trying to be a drop-in replacement for the pandas
API, while dask
is lazily evaluated.总之modin
正试图成为一个下拉更换为pandas
API,而dask
是懒洋洋地评估。 modin
is a column store, while dask
partitions data frames by rows. modin
是一个列存储,而dask
分区的数据帧由行。 The distribution engine behind dask
is centralized, while that of modin
(called ray
) is not.背后的分发引擎dask
集中,而中modin
(被称为ray
)不是。 Edit : Now modin
supports dask
as calculation engine too.编辑:现在modin
支持dask
作为计算引擎了。
dask
was the first, has large eco-system and looks really well documented, discussed in forums and demonstrated on videos. dask
是第一个,拥有庞大的生态系统,并且看起来非常有据可查,在论坛中进行了讨论并在视频中进行了演示。 modin
( ray
) has some design choices which allow it to be more flexible in terms of resilience for hardware errors and high-performance serialization. modin
( ray
) 有一些设计选择,使其在硬件错误的弹性和高性能序列化方面更加灵活。 ray
aims at being most useful in AI research, but modin
itself is of general use. ray
目标是在 AI 研究中最有用,但modin
本身是通用的。 ray
also aims at real-time applications to support real-time reinforcement learning better. ray
还针对实时应用,以更好地支持实时强化学习。
I have a task of dealing with daily stock trading data and came across this post.我有一项处理每日股票交易数据的任务,并遇到了这篇文章。 The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv
and groupby mean
.我的行的长度约为 6000 万,列的长度低于 10。我在read_csv
和groupby mean
对所有 3 个库进行了测试。 Based upon this little test my choice is dask
.基于这个小测试,我的选择是dask
。 Below is a comparison of the 3:下面是三者的对比:
| library | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin | 175s | 150s |
| dask | 0s (lazy load) | 27s |
| dask persist | 26s | 1s |
| datatable | 8s | 6s |
It seems that modin
is not as efficient as dask
at the moment, at least for my data.看来, modin
是效率不高的dask
此刻,至少我的数据。 dask persist
tells dask
that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. dask persist
告诉dask
您的数据可以放入内存中,因此 dask 需要一些时间将所有内容放入而不是延迟加载。 datatable
originally has all data in memory and is super fast in both read_csv and groupby. datatable
最初在内存中包含所有数据,并且在 read_csv 和 groupby 中都非常快。 However, given its incompatibility with pandas it seems better to use dask
.但是,鉴于它与熊猫不兼容,使用dask
似乎更好。 Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python.实际上我来自 R 并且我非常熟悉 R 的 data.table 所以我在 python 中应用它的语法没有问题。 If datatable
in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.如果 python 中的datatable
可以无缝连接到 Pandas(就像 R 中的 data.frame 那样),那么这将是我的选择。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.