Is there an equivalent package in R to Python's dask
? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine.
Link to Python's Dask
page: https://dask.pydata.org/en/latest/
From the Dask website:
Dask natively scales Python
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.
But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.
I am developing a simple library called disk.frame
that has the potential to take on dask
one day. It uses the fst
file format and data.table
to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses future
in the background and future
can have cluster back-ends, it is a possibility in the future.
There is also multidplyr in the works by Hadley and co.
Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns.
If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.
As a general matter, R, in its native use, operates on data in RAM. Depending on your operating system, when R requires more than the available memory, portions are swapped out to disk. The normal result is thrashing that will bring your machine to a halt. In Windows, you can watch the Task Manager and cry.
There are a few packages that promise to manage this process. RevoScaleR from Microsoft is one. It is not open source and is not available from CRAN. I am as skeptical of software add-ons to R as bolt-on gadgets that promise better fuel economy in your car. There are always trade-offs.
The simple answer is that there is no free lunch in R. A download will not be as effective as some new DIMMs for your machine. You are better off looking at your code first. If that doesn't work, then hire a properly-sized configuration in the cloud.
I think worths take a look to Apache Arrow project ( https://arrow.apache.org/ ) and its integration with several languages, among them, R ( https://arrow.apache.org/docs/r/ )
I have tested the example on 112 million rows, and it works amazingly fast!
as I came across this question, I felt it is important to address this issue, as Dask does not exist in R, and it is difficult to find alternative to it. However, R also provides good solutions. Below I address some:
data.table::fread()
function. This is reads the data in an out of RAM mode.ff
or bigmemory
package if your dataframe is not extremely big. Else, use matter
package. below is a link to matter:https://bioconductor.org/packages/release/bioc/manuals/matter/man/matter.pdf
what is nice about matter
package is that it allows chunking data and delayed pre-processing, which is exactly what Dask
does. You can use verbose = TRUE
argument to see the chunks while being processed. I would say this is quite similar to the client dashboard of Dask
. To get the best out of this, set BPPARAM = bpparam()
while performing chunkApply()
an example would be like this:
## Operate on elements/rows/columns
chunk_apply(X, MARGIN, FUN, ...,
simplify = FALSE, chunks = 20, outpath = NULL,
verbose = TRUE, BPPARAM = bpparam())
you can also check for the registered bpparam using: BiocParallel::registered()
or register one yourself, for example MulticoreParam()
if you are on Linux
or macOS
.
As for the FUN argument, you can write a custom function and plug it in, say function(x){SomeTasksOn(x)}
.
This method has been extensively used in Cardinal package to process huge biological datasets that are way too large to process. An example workflow includes:
they also tried to do this using regularized regression with neighborhood convolution, then extract relevant features per class of interest. It works pretty well and fast.
To get a view of why and how it does it well, I would suggest this article (Link: https://academic.oup.com/bioinformatics/article/33/19/3142/3868724 ), where they comparematter to R matrices, bigmemory, ff
to perform tasks like glm, and lm.
So, I would suggest to give it a try and see if it helps with processing big data.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.