简体   繁体   中英

R equivalent of Python's dask

Is there an equivalent package in R to Python's dask ? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine.

Link to Python's Dask page: https://dask.pydata.org/en/latest/

From the Dask website:

Dask natively scales Python

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.

But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.

I am developing a simple library called disk.frame that has the potential to take on dask one day. It uses the fst file format and data.table to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses future in the background and future can have cluster back-ends, it is a possibility in the future.

There is also multidplyr in the works by Hadley and co.

Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns.

If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.

As a general matter, R, in its native use, operates on data in RAM. Depending on your operating system, when R requires more than the available memory, portions are swapped out to disk. The normal result is thrashing that will bring your machine to a halt. In Windows, you can watch the Task Manager and cry.

There are a few packages that promise to manage this process. RevoScaleR from Microsoft is one. It is not open source and is not available from CRAN. I am as skeptical of software add-ons to R as bolt-on gadgets that promise better fuel economy in your car. There are always trade-offs.

The simple answer is that there is no free lunch in R. A download will not be as effective as some new DIMMs for your machine. You are better off looking at your code first. If that doesn't work, then hire a properly-sized configuration in the cloud.

I think worths take a look to Apache Arrow project ( https://arrow.apache.org/ ) and its integration with several languages, among them, R ( https://arrow.apache.org/docs/r/ )

I have tested the example on 112 million rows, and it works amazingly fast!

as I came across this question, I felt it is important to address this issue, as Dask does not exist in R, and it is difficult to find alternative to it. However, R also provides good solutions. Below I address some:

  1. Always read dataframes using data.table::fread() function. This is reads the data in an out of RAM mode.
  2. use ff or bigmemory package if your dataframe is not extremely big. Else, use matter package. below is a link to matter:

https://bioconductor.org/packages/release/bioc/manuals/matter/man/matter.pdf

what is nice about matter package is that it allows chunking data and delayed pre-processing, which is exactly what Dask does. You can use verbose = TRUE argument to see the chunks while being processed. I would say this is quite similar to the client dashboard of Dask . To get the best out of this, set BPPARAM = bpparam() while performing chunkApply() an example would be like this:

## Operate on elements/rows/columns
chunk_apply(X, MARGIN, FUN, ...,
simplify = FALSE, chunks = 20, outpath = NULL,
verbose = TRUE, BPPARAM = bpparam())

you can also check for the registered bpparam using: BiocParallel::registered() or register one yourself, for example MulticoreParam() if you are on Linux or macOS .

As for the FUN argument, you can write a custom function and plug it in, say function(x){SomeTasksOn(x)} .

This method has been extensively used in Cardinal package to process huge biological datasets that are way too large to process. An example workflow includes:

http://bioconductor.org/packages/release/data/experiment/vi.nettes/CardinalWorkflows/inst/doc/MSI-classification.html#pre-processing

they also tried to do this using regularized regression with neighborhood convolution, then extract relevant features per class of interest. It works pretty well and fast.

To get a view of why and how it does it well, I would suggest this article (Link: https://academic.oup.com/bioinformatics/article/33/19/3142/3868724 ), where they comparematter to R matrices, bigmemory, ff to perform tasks like glm, and lm.

在此处输入图像描述

So, I would suggest to give it a try and see if it helps with processing big data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM