简体   繁体   中英

ValueError: Maximum allowed dimension exceeded, AgglomerativeClustering fit_predict

I'm tryinng to fit hierarchical clustering on a 23-dimensional dataset of 100.000 objects. How to solve the follwing error?

>>>ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
>>>k=hf.features_itter(hf.file)
>>>k


array([[49,  0,  3, ...,  0,  0,  3],
       [39,  1,  4, ...,  0,  0,  3],
       [25,  0,  3, ...,  0,  0,  1],
       ...,
       [21,  0,  6, ...,  0,  0,  1],
       [47,  0,  8, ...,  0,  0,  2],
       [28,  1,  2, ...,  0,  1,  3]], dtype=uint8)

>>>res = ac.fit_predict(k)

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    hierarchical()
  File "C:\Users\Tolis\Downloads\WPy-3670\notebooks\ergasia\clustering.py", line 39, in hierarchical
    ac.fit_predict(k)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\base.py", line 355, in fit_predict
    self.fit(X)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 830, in fit
    **kwargs)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\externals\joblib\memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 584, in _complete_linkage
    return linkage_tree(*args, **kwargs)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 470, in linkage_tree
    out = hierarchy.linkage(X, method=linkage, metric=affinity)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\scipy\cluster\hierarchy.py", line 708, in linkage
    y = distance.pdist(y, metric)
  File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\scipy\spatial\distance.py", line 1877, in pdist
    dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
ValueError: Maximum allowed dimension exceeded

ValueError: Maximum allowed dimension exceeded

I guess - there's no elegant solution of this issue using Agglomerative clustering because of some properties of this algorithm. You measure distances between all pairs of objects while the function

y = distance.pdist(y, metric)

is invoked inside AgglomerativeClustering .

So, AgglomerativeClustering algorithm does not fit well for big or even medium-size datasets:

The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of O(n^3) and requires O(n^2) memory, which makes it too slow for even medium data sets.

- because it's slow, and, also, there's O(n^2) memory. Even if the algorithm uses RAM in optimal way, the matrix of pairwise distances consumes ~ 1e10 * 4 bytes (~40Gb) оf memory - because each float32 value consumes 4 bytes and there are 10.000 * 10.000 of such measurements. Probably there's not enough memory.

(I've tested pairwise distance for 100.000 random points with ~100Gb RAM, and it computes soooo long - althought haven't failed)

Also, it will run for very long time - because of it's O(n^3) time complexity.

I suggest you to try sklearn.cluster.DBSCAN - it has similar behaviour for some data ( sklearn examples ), also, it runs a way faster and consumes much less memory:

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

Memory consumption:

This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(nd) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n). It may attract a higher memory complexity when querying these nearest neighborhoods, depending on the algorithm

Time complexity: average O(n log n) , but depends on implementation, worst-case O(n^2) - a way better than O(n^3) for agglomerative.

Check this clustering algorithm, probably it will give nice results. The main problem is that DBSCAN defines the number of cluster automatically, so you can't set it to 2.

Thanx for the answer! I had to use hierarchical clustering because that was the case of study, so i followed the solution described at link

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM