简体   繁体   English

在MATLAB中使用clusterdata时出现内存不足错误

[英]Out of memory error while using clusterdata in MATLAB

I am trying to cluster a Matrix (size: 20057x2).: 我正在尝试聚类矩阵(大小:20057x2):

T = clusterdata(X,cutoff);

but I get this error: 但我得到这个错误:

??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.

Error in ==> pdist at 211
    Y = pdistmex(X',dist,additionalArg);

Error in ==> linkage at 139
       Z = linkagemex(Y,method,pdistArg);

Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);

Error in ==> kmeansTest at 2
T = clusterdata(X,1);

can someone help me. 有人能帮我吗。 I have 4GB of ram, but think that the problem is from somewhere else.. 我有4GB的内存,但认为问题来自其他地方..

As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case. 正如其他人所提到的,层次聚类需要计算成对距离矩阵,该矩阵太大而不适合您的情况。

Try using the K-Means algorithm instead: 请尝试使用K-Means算法:

numClusters = 4;
T = kmeans(X, numClusters);

Alternatively you can select a random subset of your data and use as input to the clustering algorithm. 或者,您可以选择数据的随机子集,并将其用作聚类算法的输入。 Next you compute the cluster centers as mean/median of each cluster group. 接下来,将聚类中心计算为每个聚类组的平均值/中值。 Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one. 最后,对于未在子集中选择的每个实例,您只需计算其到每个质心的距离,并将其分配给最接近的一个。

Here's a sample code to illustrate the idea above: 以下是用于说明上述想法的示例代码:

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight

树状图 集群

X is too big to do on a 32 bit machine. X太大了,无法在32位机器上运行。 pdist is trying to make a 201,131,596 row vector ( clusterdata uses pdist ) of doubles, which would use up about 1609MB ( double is 8 bytes) ... if you run it under windows with the /3GB switch you're limited to a maximum matrix size of 1536MB (see here ). pdist试图制作一个201,131,596行向量( clusterdata使用pdist )的双精度数,这将耗尽约1609MB( double精度为8个字节)...如果你在带有/ 3GB开关的windows下运行它你被限制为最大值矩阵大小为1536MB(见这里 )。

You're going to need to divide up the data someway instead of directly clustering all of it in one go. 您需要将数据分开,而不是直接将所有数据一次性聚类。

PDIST calculates distances between all possible pairs of rows. PDIST计算所有可能的行对之间的距离。 If your data contain N=20057 rows, then number of pairs will be N*(N-1)/2, which is 201131596 in your case. 如果您的数据包含N = 20057行,那么对的数量将为N *(N-1)/ 2,在您的情况下为201131596。 Might be too much for your machine. 你的机器可能太多了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM