从 MATLAB 中的非常大的数据集生成直方图的有效方法？

Question

I have two 2D arrays of size up to 35,000*35,000 each: indices and dotPs .我有两个 2D arrays 每个大小高达35,000*35,000 ： indices和dotPs 。 From this, I want to create two 1D arrays such that pop contains the number of times each number appears in indices and nn contains the sum of elements in dotPs that correspond to those numbers.由此，我想创建两个 1D arrays 使得pop包含每个数字出现在indices中的次数， nn包含与这些数字对应的dotPs中元素的总和。 I have come up with the following (really dumb) way:我想出了以下（非常愚蠢）的方式：

dotPs = [81.4285    9.2648   46.3184    5.7974    4.5016    2.6779   16.0092   41.1426;
      9.2648   24.3525   11.4308   14.6598   17.9558   23.4246   19.4837   14.1173;
     46.3184   11.4308   92.9264    9.2036    2.9957    0.1164   26.5770   26.0243;
      5.7974   14.6598    9.2036   34.9984   16.2352   19.4568   31.8712    5.0732;
      4.5016   17.9558    2.9957   16.2352   19.6595   16.0678    3.5750   16.7702;
      2.6779   23.4246    0.1164   19.4568   16.0678   25.1084    6.6237   15.6188;
     16.0092   19.4837   26.5770   31.8712    3.5750    6.6237   61.6045   16.6102;
     41.1426   14.1173   26.0243    5.0732   16.7702   15.6188   16.6102   47.3289];

indices = [3     2     1     1     2     1     2     1;
           2     2     1     2     2     1     2     2;
           1     1     3     3     2     2     2     2;
           1     2     3     4     3     3     4     2;
           2     2     2     3     3     1     3     2;
           1     1     2     3     1     8     2     2;
           2     2     2     4     3     2     4     2;
           1     2     2     2     2     2     2     2];


nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    [I,J]=find(indices==j);
    if j == 0 || numel(I) == 0
        continue
    end

    pop(j) = pop(j) + numel(I);
    nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end

Because of the find function, this is very slow.因为find function，这个很慢。 How can I do this more smartly so that it runs in a few seconds rather than several minutes?我怎样才能更聪明地做到这一点，以便它在几秒钟而不是几分钟内运行？

Edit: added small dummy matrices for testing the code.编辑：添加了用于测试代码的小型虚拟矩阵。

Answer 1

Both tasks can be done with theaccumarray function:这两项任务都可以使用accumarray function 完成：

pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';

This assumes that indices only contains positive integers.这假设indices只包含正整数。

EDIT:编辑：

From comments, only the lower part of the indices matrix without the diagonal should be used, and it is guaranteed to contain positive integers.根据评论，应该只使用没有对角线的indices矩阵的下部，并且保证包含正整数。 In that case:在这种情况下：

mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask); 
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';

Answer 2

First of all, note that the dimension of indices does not matter (eg if both indices and dotPs were 1D arrays or 3D arrays the result will be the same).首先，请注意indices的维度无关紧要（例如，如果indices和dotPs都是 1D arrays 或 3D arrays 结果将是相同的）。

pop can be calculated by histcount function, but since you also need to calculate the sum of the corresponding elements of dotPs array the problem becomes harder. pop可以通过histcount function 来计算，但是由于还需要计算dotPs数组对应元素的总和，所以问题变得更加困难。

Here is a possible solution with a for loop.这是一个带有for循环的可能解决方案。 The advantage of this solution is that I am not calling find function in a loop, so it should be faster:这个解决方案的优点是我不是在循环中调用find function ，所以它应该更快：

%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);

%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
    nn(ic(i))=nn(ic(i))+1;
    pop(ic(i))=pop(ic(i))+dotPs(i);
end

This solution uses a vector ic to categorize each of the input values.此解决方案使用向量ic对每个输入值进行分类。 After that, I go through each element and update nn(ic) and pop(ic) .之后，我通过每个元素 go 并更新nn(ic)和pop(ic) 。

Answer 3

For computing pop , you can use hist , for computing nn , I couldn't find a smart solution (but I found a solution without using find ):对于计算pop ，您可以使用hist ，对于计算nn ，我找不到一个聪明的解决方案（但我找到了一个不使用find的解决方案）：

pop = hist(indices(:), max(indices(:)));

nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    nn(j) = sum(dotPs(indices == j));
end

There must be a better solution for computing nn .计算nn必须有更好的解决方案。

I found a smarter solution applying sorting.我找到了一个更聪明的解决方案来应用排序。

I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.我不确定它是否更快，因为对 35,000*35,000 个元素进行排序可能需要很长时间。

Sort indices just for getting the index for sorting dotPs by indices .排序indices只是为了获取按索引排序dotPs的indices 。
Sort dotPs according to index returned by previous sort.根据先前排序返回的索引对dotPs进行排序。
cumsumPop = Compute cumulative sum of pop (cumulative sum of the histogram of indices ). cumsumPop = 计算pop的累积总和（ indices直方图的累积总和）。
cumsumPs = Compute cumulative sum of sorted dotPs . cumsumPs = 计算已排序dotPs的累积总和。
Now values of cumsumPop can be used as indices in cumsumPs.现在 cumsumPop 的值可以用作 cumsumPs 中的索引。
Because cumsumPs is cumulative sum, we need to use diff for getting the solution.因为 cumsumPs 是累积和，所以我们需要使用diff来获得解决方案。

Here is the "smart" solution:这是“智能”解决方案：

pop = hist(indices(:), max(indices(:)));

[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);

cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);

nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';

从 MATLAB 中的非常大的数据集生成直方图的有效方法？

问题描述

3 个解决方案

解决方案1
3 2019-10-09 22:13:04

解决方案2
1 2019-10-09 20:39:29

解决方案3
1 2019-10-09 21:00:52

从 MATLAB 中的非常大的数据集生成直方图的有效方法？

问题描述

3 个解决方案

解决方案1 3 2019-10-09 22:13:04

解决方案2 1 2019-10-09 20:39:29

解决方案3 1 2019-10-09 21:00:52

解决方案1
3 2019-10-09 22:13:04

解决方案2
1 2019-10-09 20:39:29

解决方案3
1 2019-10-09 21:00:52