简体繁体中英

How to cluster big data using Python or R without memory error?

原文 2019-10-26 01:08:17 9 1 python/ r/ bigdata/ cluster-analysis

I am trying to cluster a data set with about 1,100,000 observations, each with three values.

The code is pretty simple in R :

df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.

the error I get is: Error: cannot allocate vector of size 4439.0 Gb

Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.

I have also found this solution , but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?

Can you recommend anything in either R or python ?

1 answers

For trivial reasons, the function dist needs quadratic memory.

So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., Ie 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.

And 1 million is still pretty small, isn't it?

Using dist on big data is impossible. End of story.

For larger data sets, you'll need to

use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller

In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.

How to cluster very big sparse data set using low memory in Python?

Processing a very very big data set in python - memory error

How to release memory after replacing big data in python?

Python Shared Memory Dictionary for Mapping Big Data

Python ndarray form big file, memory error

Python Memory Error while iterating to a big range

cleaning big data using python

How to read HDF5 files in R without the memory error?

How to put many numpy files in one big numpy file without having memory error?

How to find the specific line on a big data that is causing error in python script?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to cluster very big sparse data set using low memory in Python? Processing a very very big data set in python - memory error How to release memory after replacing big data in python? Python Shared Memory Dictionary for Mapping Big Data Python ndarray form big file, memory error Python Memory Error while iterating to a big range cleaning big data using python How to read HDF5 files in R without the memory error? How to put many numpy files in one big numpy file without having memory error? How to find the specific line on a big data that is causing error in python script?

Related Tags

How to cluster big data using Python or R without memory error?

Question

1 answers

solution1 3 ACCPTED 2019-10-27 09:59:33

solution1
3 ACCPTED 2019-10-27 09:59:33