简体   繁体   中英

How to cluster big data using Python or R without memory error?

I am trying to cluster a data set with about 1,100,000 observations, each with three values.

The code is pretty simple in R :

df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.

the error I get is: Error: cannot allocate vector of size 4439.0 Gb

Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.

I have also found this solution , but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?

Can you recommend anything in either R or python ?

For trivial reasons, the function dist needs quadratic memory.

So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., Ie 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.

And 1 million is still pretty small, isn't it?

Using dist on big data is impossible. End of story.

For larger data sets, you'll need to

  • use methods such as k-means that do not use pairwise distances
  • use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
  • subsample your data to make it smaller

In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM