在MATLAB中使用clusterdata时出现内存不足错误

Question

I am trying to cluster a Matrix (size: 20057x2).: 我正在尝试聚类矩阵（大小：20057x2）：

T = clusterdata(X,cutoff);

but I get this error: 但我得到这个错误：

??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.

Error in ==> pdist at 211
    Y = pdistmex(X',dist,additionalArg);

Error in ==> linkage at 139
       Z = linkagemex(Y,method,pdistArg);

Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);

Error in ==> kmeansTest at 2
T = clusterdata(X,1);

can someone help me. 有人能帮我吗。 I have 4GB of ram, but think that the problem is from somewhere else.. 我有4GB的内存，但认为问题来自其他地方..

Answer 1

As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case. 正如其他人所提到的，层次聚类需要计算成对距离矩阵，该矩阵太大而不适合您的情况。

Try using the K-Means algorithm instead: 请尝试使用K-Means算法：

numClusters = 4;
T = kmeans(X, numClusters);

Alternatively you can select a random subset of your data and use as input to the clustering algorithm. 或者，您可以选择数据的随机子集，并将其用作聚类算法的输入。 Next you compute the cluster centers as mean/median of each cluster group. 接下来，将聚类中心计算为每个聚类组的平均值/中值。 Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one. 最后，对于未在子集中选择的每个实例，您只需计算其到每个质心的距离，并将其分配给最接近的一个。

Here's a sample code to illustrate the idea above: 以下是用于说明上述想法的示例代码：

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight

Answer 2

X is too big to do on a 32 bit machine. X太大了，无法在32位机器上运行。 pdist is trying to make a 201,131,596 row vector ( clusterdata uses pdist ) of doubles, which would use up about 1609MB ( double is 8 bytes) ... if you run it under windows with the /3GB switch you're limited to a maximum matrix size of 1536MB (see here ). pdist试图制作一个201,131,596行向量（ clusterdata使用pdist ）的双精度数，这将耗尽约1609MB（ double精度为8个字节）...如果你在带有/ 3GB开关的windows下运行它你被限制为最大值矩阵大小为1536MB（见这里）。

You're going to need to divide up the data someway instead of directly clustering all of it in one go. 您需要将数据分开，而不是直接将所有数据一次性聚类。

Answer 3

PDIST calculates distances between all possible pairs of rows. PDIST计算所有可能的行对之间的距离。 If your data contain N=20057 rows, then number of pairs will be N*(N-1)/2, which is 201131596 in your case. 如果您的数据包含N = 20057行，那么对的数量将为N *（N-1）/ 2，在您的情况下为201131596。 Might be too much for your machine. 你的机器可能太多了。

在MATLAB中使用clusterdata时出现内存不足错误

问题描述

3 个解决方案

解决方案1
13 已采纳 2010-05-31 22:40:26

解决方案2
2 2010-05-31 21:56:41

解决方案3
1 2010-05-31 22:04:42

在MATLAB中使用clusterdata时出现内存不足错误

问题描述

3 个解决方案

解决方案1 13 已采纳 2010-05-31 22:40:26

解决方案2 2 2010-05-31 21:56:41

解决方案3 1 2010-05-31 22:04:42

解决方案1
13 已采纳 2010-05-31 22:40:26

解决方案2
2 2010-05-31 21:56:41

解决方案3
1 2010-05-31 22:04:42