简体   繁体   English

在C#中计算数组频率分布的最快方法是什么?

[英]What is the fastest way to calculate frequency distribution for array in C#?

I am just wondering what is the best approach for that calculation. 我只是想知道这个计算的最佳方法是什么。 Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array. 假设我有一个值的输入数组和边界数组 - 我想计算/ bucketize边界数组中每个段的频率分布。

Is it good idea to use bucket search for that? 使用桶搜索是不是一个好主意?

Actually I found that question Calculating frequency distribution of a collection with .Net/C# 实际上我发现这个问题用.Net / C#计算集合的频率分布

But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation. 但是我不明白如何使用桶来达到这个目的,因为每个桶的大小在我的情况下可能会有所不同。

EDIT: After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. 编辑:在所有的讨论之后我有内部/外部循环解决方案,但是我仍然希望在这种情况下消除带有字典的内部循环以获得O(n)性能,如果我理解正确的话我需要将输入值散列到存储桶索引中。 So we need some sort of hash function with O(1) complexity? 所以我们需要某种具有O(1)复杂度的哈希函数? Any ideas how to do it? 有什么想法怎么做?

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Bucket Sort已经是O(n ^ 2)最坏的情况,所以我在这里只做一个简单的内/外循环。 Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. 由于您的存储桶数组必须比输入数组短,因此请将其保留在内部循环中。 Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop. 由于您使用的是自定义存储桶大小,因此实际上没有可以消除内部循环的数学技巧。

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

It's also O(n^2) worst case but you can't beat the code simplicity. 它也是O(n ^ 2)最坏的情况,但你无法击败代码简单性。 I wouldn't worry about optimization until it becomes a real issue. 我不担心优化,直到它成为一个真正的问题。 If you have a larger bucket array, you could use a binary search of some sort. 如果你有一个更大的桶阵列,你可以使用某种二进制搜索。 But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit. 但是,由于频率分布通常<100个元素,我怀疑你会看到很多真实的性能优势。

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach: 如果您的输入数组表示真实世界数据(带有模式),并且边界数组很大,可以在内部循环中反复迭代它,您可以考虑以下方法:

  • First of all sort your input array. 首先对输入数组进行排序。 If you work with real-world data I would recommend to consider Timsort - Wiki for this. 如果您使用真实数据,我建议您考虑Timsort - Wiki It provides very good performance guarantees for a patterns that can be seen in real-world data. 它为可在实际数据中看到的模式提供了非常好的性能保证。

  • Traverse through sorted array and compare it with the first value in the array of boundaries: 遍历排序数组并将其与边界数组中的第一个值进行比较:

    • If value in input array is less then boundary - increment frequency counter for this boundary 如果输入数组中的值小于边界 - 则增加此边界的频率计数器
    • If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary. 如果输入数组中的值大于边界 - 转到边界数组中的下一个值并增加新边界的计数器。

In a code it can look like this: 在代码中它看起来像这样:

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM