[英]Efficient way to generate histogram from very large dataset in MATLAB?
I have two 2D arrays of size up to 35,000*35,000
each: indices
and dotPs
.我有两个 2D arrays 每个大小高达35,000*35,000
: indices
和dotPs
。 From this, I want to create two 1D arrays such that pop
contains the number of times each number appears in indices
and nn
contains the sum of elements in dotPs
that correspond to those numbers.由此,我想创建两个 1D arrays 使得pop
包含每个数字出现在indices
中的次数, nn
包含与这些数字对应的dotPs
中元素的总和。 I have come up with the following (really dumb) way:我想出了以下(非常愚蠢)的方式:
dotPs = [81.4285 9.2648 46.3184 5.7974 4.5016 2.6779 16.0092 41.1426;
9.2648 24.3525 11.4308 14.6598 17.9558 23.4246 19.4837 14.1173;
46.3184 11.4308 92.9264 9.2036 2.9957 0.1164 26.5770 26.0243;
5.7974 14.6598 9.2036 34.9984 16.2352 19.4568 31.8712 5.0732;
4.5016 17.9558 2.9957 16.2352 19.6595 16.0678 3.5750 16.7702;
2.6779 23.4246 0.1164 19.4568 16.0678 25.1084 6.6237 15.6188;
16.0092 19.4837 26.5770 31.8712 3.5750 6.6237 61.6045 16.6102;
41.1426 14.1173 26.0243 5.0732 16.7702 15.6188 16.6102 47.3289];
indices = [3 2 1 1 2 1 2 1;
2 2 1 2 2 1 2 2;
1 1 3 3 2 2 2 2;
1 2 3 4 3 3 4 2;
2 2 2 3 3 1 3 2;
1 1 2 3 1 8 2 2;
2 2 2 4 3 2 4 2;
1 2 2 2 2 2 2 2];
nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
[I,J]=find(indices==j);
if j == 0 || numel(I) == 0
continue
end
pop(j) = pop(j) + numel(I);
nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end
Because of the find
function, this is very slow.因为find
function,这个很慢。 How can I do this more smartly so that it runs in a few seconds rather than several minutes?我怎样才能更聪明地做到这一点,以便它在几秒钟而不是几分钟内运行?
Edit: added small dummy matrices for testing the code.编辑:添加了用于测试代码的小型虚拟矩阵。
Both tasks can be done with theaccumarray
function:这两项任务都可以使用accumarray
function 完成:
pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';
This assumes that indices
only contains positive integers.这假设indices
只包含正整数。
EDIT:编辑:
From comments, only the lower part of the indices
matrix without the diagonal should be used, and it is guaranteed to contain positive integers.根据评论,应该只使用没有对角线的indices
矩阵的下部,并且保证包含正整数。 In that case:在这种情况下:
mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask);
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';
First of all, note that the dimension of indices
does not matter (eg if both indices
and dotPs
were 1D arrays or 3D arrays the result will be the same).首先,请注意indices
的维度无关紧要(例如,如果indices
和dotPs
都是 1D arrays 或 3D arrays 结果将是相同的)。
pop
can be calculated by histcount
function, but since you also need to calculate the sum of the corresponding elements of dotPs
array the problem becomes harder. pop
可以通过histcount
function 来计算,但是由于还需要计算dotPs
数组对应元素的总和,所以问题变得更加困难。
Here is a possible solution with a for
loop.这是一个带有for
循环的可能解决方案。 The advantage of this solution is that I am not calling find
function in a loop, so it should be faster:这个解决方案的优点是我不是在循环中调用find
function ,所以它应该更快:
%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);
%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
nn(ic(i))=nn(ic(i))+1;
pop(ic(i))=pop(ic(i))+dotPs(i);
end
This solution uses a vector ic
to categorize each of the input values.此解决方案使用向量ic
对每个输入值进行分类。 After that, I go through each element and update nn(ic)
and pop(ic)
.之后,我通过每个元素 go 并更新nn(ic)
和pop(ic)
。
For computing pop
, you can use hist , for computing nn
, I couldn't find a smart solution (but I found a solution without using find
):对于计算pop
,您可以使用hist ,对于计算nn
,我找不到一个聪明的解决方案(但我找到了一个不使用find
的解决方案):
pop = hist(indices(:), max(indices(:)));
nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
nn(j) = sum(dotPs(indices == j));
end
There must be a better solution for computing nn
.计算nn
必须有更好的解决方案。
I found a smarter solution applying sorting.我找到了一个更聪明的解决方案来应用排序。
I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.我不确定它是否更快,因为对 35,000*35,000 个元素进行排序可能需要很长时间。
indices
just for getting the index for sorting dotPs
by indices
.排序indices
只是为了获取按索引排序dotPs
的indices
。dotPs
according to index returned by previous sort.根据先前排序返回的索引对dotPs
进行排序。pop
(cumulative sum of the histogram of indices
). cumsumPop = 计算pop
的累积总和( indices
直方图的累积总和)。 cumsumPs = Compute cumulative sum of sorted dotPs
. cumsumPs = 计算已排序dotPs
的累积总和。
Now values of cumsumPop can be used as indices in cumsumPs.现在 cumsumPop 的值可以用作 cumsumPs 中的索引。
Because cumsumPs is cumulative sum, we need to use diff
for getting the solution.因为 cumsumPs 是累积和,所以我们需要使用diff
来获得解决方案。
Here is the "smart" solution:这是“智能”解决方案:
pop = hist(indices(:), max(indices(:)));
[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);
cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);
nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.