简体   繁体   English

从 MATLAB 中的非常大的数据集生成直方图的有效方法?

[英]Efficient way to generate histogram from very large dataset in MATLAB?

I have two 2D arrays of size up to 35,000*35,000 each: indices and dotPs .我有两个 2D arrays 每个大小高达35,000*35,000indicesdotPs From this, I want to create two 1D arrays such that pop contains the number of times each number appears in indices and nn contains the sum of elements in dotPs that correspond to those numbers.由此,我想创建两个 1D arrays 使得pop包含每个数字出现在indices中的次数, nn包含与这些数字对应的dotPs中元素的总和。 I have come up with the following (really dumb) way:我想出了以下(非常愚蠢)的方式:

dotPs = [81.4285    9.2648   46.3184    5.7974    4.5016    2.6779   16.0092   41.1426;
      9.2648   24.3525   11.4308   14.6598   17.9558   23.4246   19.4837   14.1173;
     46.3184   11.4308   92.9264    9.2036    2.9957    0.1164   26.5770   26.0243;
      5.7974   14.6598    9.2036   34.9984   16.2352   19.4568   31.8712    5.0732;
      4.5016   17.9558    2.9957   16.2352   19.6595   16.0678    3.5750   16.7702;
      2.6779   23.4246    0.1164   19.4568   16.0678   25.1084    6.6237   15.6188;
     16.0092   19.4837   26.5770   31.8712    3.5750    6.6237   61.6045   16.6102;
     41.1426   14.1173   26.0243    5.0732   16.7702   15.6188   16.6102   47.3289];

indices = [3     2     1     1     2     1     2     1;
           2     2     1     2     2     1     2     2;
           1     1     3     3     2     2     2     2;
           1     2     3     4     3     3     4     2;
           2     2     2     3     3     1     3     2;
           1     1     2     3     1     8     2     2;
           2     2     2     4     3     2     4     2;
           1     2     2     2     2     2     2     2];


nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    [I,J]=find(indices==j);
    if j == 0 || numel(I) == 0
        continue
    end

    pop(j) = pop(j) + numel(I);
    nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end

Because of the find function, this is very slow.因为find function,这个很慢。 How can I do this more smartly so that it runs in a few seconds rather than several minutes?我怎样才能更聪明地做到这一点,以便它在几秒钟而不是几分钟内运行?

Edit: added small dummy matrices for testing the code.编辑:添加了用于测试代码的小型虚拟矩阵。

Both tasks can be done with theaccumarray function:这两项任务都可以使用accumarray function 完成:

pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';

This assumes that indices only contains positive integers.这假设indices只包含正整数。


EDIT:编辑:

From comments, only the lower part of the indices matrix without the diagonal should be used, and it is guaranteed to contain positive integers.根据评论,应该只使用没有对角线的indices矩阵的下部,并且保证包含正整数。 In that case:在这种情况下:

mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask); 
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';

First of all, note that the dimension of indices does not matter (eg if both indices and dotPs were 1D arrays or 3D arrays the result will be the same).首先,请注意indices的维度无关紧要(例如,如果indicesdotPs都是 1D arrays 或 3D arrays 结果将是相同的)。

pop can be calculated by histcount function, but since you also need to calculate the sum of the corresponding elements of dotPs array the problem becomes harder. pop可以通过histcount function 来计算,但是由于还需要计算dotPs数组对应元素的总和,所以问题变得更加困难。

Here is a possible solution with a for loop.这是一个带有for循环的可能解决方案。 The advantage of this solution is that I am not calling find function in a loop, so it should be faster:这个解决方案的优点是我不是在循环中调用find function ,所以它应该更快:

%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);

%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
    nn(ic(i))=nn(ic(i))+1;
    pop(ic(i))=pop(ic(i))+dotPs(i);
end

This solution uses a vector ic to categorize each of the input values.此解决方案使用向量ic对每个输入值进行分类。 After that, I go through each element and update nn(ic) and pop(ic) .之后,我通过每个元素 go 并更新nn(ic)pop(ic)

For computing pop , you can use hist , for computing nn , I couldn't find a smart solution (but I found a solution without using find ):对于计算pop ,您可以使用hist ,对于计算nn ,我找不到一个聪明的解决方案(但我找到了一个不使用find的解决方案):

pop = hist(indices(:), max(indices(:)));

nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    nn(j) = sum(dotPs(indices == j));
end

There must be a better solution for computing nn .计算nn必须有更好的解决方案。


I found a smarter solution applying sorting.我找到了一个更聪明的解决方案来应用排序。

I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.我不确定它是否更快,因为对 35,000*35,000 个元素进行排序可能需要很长时间。

  1. Sort indices just for getting the index for sorting dotPs by indices .排序indices只是为了获取按索引排序dotPsindices
  2. Sort dotPs according to index returned by previous sort.根据先前排序返回的索引对dotPs进行排序。
  3. cumsumPop = Compute cumulative sum of pop (cumulative sum of the histogram of indices ). cumsumPop = 计算pop的累积总和( indices直方图的累积总和)。
  4. cumsumPs = Compute cumulative sum of sorted dotPs . cumsumPs = 计算已排序dotPs的累积总和。

  5. Now values of cumsumPop can be used as indices in cumsumPs.现在 cumsumPop 的值可以用作 cumsumPs 中的索引。
    Because cumsumPs is cumulative sum, we need to use diff for getting the solution.因为 cumsumPs 是累积和,所以我们需要使用diff来获得解决方案。

Here is the "smart" solution:这是“智能”解决方案:

pop = hist(indices(:), max(indices(:)));

[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);

cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);

nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 JavaScript-比较大量对象的有效方法 - JavaScript - Efficient way to compare very large array of objects 有效的方法来比较非常大的数值的PHP - Efficient way to compare VERY LARGE NUMBER OF VALUES php Ruby - 是找到两个非常大的数组之间差异的有效方法吗? - Ruby - is efficient way to find the difference between two very large arrays? 在Matlab中截断大数组的内存有效方法 - Memory-efficient way to truncate large array in Matlab Matlab:有没有办法简化为大量变量创建数据集 arrays 的命令? - Matlab: Is there a way to simplify the command to create dataset arrays for large numbers of variables? 生成大量 (x,y,z) 坐标的最有效方法 - Most efficient way to generate a large array of (x,y,z) coordinates 在matlab中使用非常大的数组 - Work with very large arrays in matlab 无法使用数组在matlab中生成直方图 - Not able to generate histogram in matlab using array 在 Python/MicroPython 中存储非常大的 2D 数组的最有效方法 - The most efficient way to store a very large 2D array in Python/MicroPython 在MATLAB中进行大量回归并存储结果的最有效方法是什么? - What is the most efficient way to do a large number of regressions in MATLAB and store the result?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM