简体   繁体   English

如何使用一系列键搜索ac#dictionary?

[英]How can I search a c# dictionary using a range of keys?

I have a dictionary with data resembling this (dictionary will be have about 100k entries): 我有一个类似于此数据的字典(字典将有大约100k条目):

[1] -> 5
[7] -> 50
[30] -> 3
[1000] -> 1
[100000] -> 35

I also have a list of ranges (about 1000) 我还有一个范围列表(约1000)

MyRanges
    Range
        LowerBoundInclusive -> 0
        UpperBoundExclusive -> 10
        Total
    Range
        LowerBoundInclusive -> 10
        UpperBoundExclusive -> 50
        Total
    Range
        LowerBoundInclusive -> 100
        UpperBoundExclusive -> 1000
        Total
    Range
        LowerBoundInclusive -> 1000
        UpperBoundExclusive -> 10000
        Total
    Range (the "other" range)
        LowerBoundInclusive -> null
        UpperBoundExclusive -> null
        Total

I need to calculate the total present in the dictionary for these ranges. 我需要计算这些范围的字典中的总数。 For example, the range 0-10 would be 55. These ranges can get really large, so I know it doesn't make sense to just search the dictionary for every value between the two ranges. 例如,范围0-10将是55.这些范围可能变得非常大,所以我知道只搜索字典中的两个范围之间的每个值都没有意义。 My hunch is that I should get a list of keys from the dictionary, sort it, then loop through my ranges and do some sort of search to find all the keys within the ranges. 我的预感是我应该从字典中获取一个键列表,对其进行排序,然后遍历我的范围并进行某种搜索以查找范围内的所有键。 Is this the correct way to do this? 这是正确的方法吗? Is there an easy way to do that? 有一个简单的方法吗?

edit: Thanks for the responses. 编辑:感谢您的回复。 Real clever stuff. 真正聪明的东西。 I forgot one pretty important caveat though. 我忘记了一个非常重要的警告。 There is not the guarantee that the ranges are continuous, and the final range is everything not in the other ranges. 无法保证范围是连续的,最终范围是不在其他范围内的所有范围。

You could do something like that: 你可以这样做:

// Associate each value with the range of its key
var lookup = dictionary.ToLookup(
    kvp => ranges.FirstOrDefault(r => r.LowerBoundInclusive <= kvp.Key
                              && r.UpperBoundExclusive > kvp.Key),
    kvp => kvp.Value);

// Compute the total of values for each range
foreach (var r in ranges)
{
    r.Total = lookup[r].Sum();
}

(note: this solution doesn't take your edit into account; it doesn't handle non-contiguous ranges and the "others" range) (注意:此解决方案不考虑您的编辑;它不处理非连续范围和“其他”范围)

However, it's not very efficient if you have many ranges, since they are enumerated for each entry in the dictionary... You could get better results if you sort the dictionary by key first. 但是,如果你有很多范围,它效率不高,因为它们是为字典中的每个条目枚举的......如果你先按键对字典进行排序,你可以得到更好的结果。

Here's a possible implementation: 这是一个可能的实现:

// We're going to need finer control over the enumeration than foreach,
// so we manipulate the enumerator directly instead.
using (var dictEnumerator = dictionary.OrderBy(e => e.Key).GetEnumerator())
{
    // No point in going any further if the dictionary is empty
    if (dictEnumerator.MoveNext())
    {
        long othersTotal = 0; // total for items that don't fall in any range

        // The ranges need to be in ascending order
        // We want the "others" range at the end
        foreach (var range in ranges.OrderBy(r => r.LowerBoundInclusive ?? int.MaxValue))
        {
            if (range.LowerBoundInclusive == null && range.UpperBoundExclusive == null)
            {
                // this is the "others" range: use the precalculated total
                // of previous items that didn't fall in any other range
                range.Total = othersTotal;
            }
            else
            {
                range.Total = 0;
            }

            int lower = range.LowerBoundInclusive ?? int.MinValue;
            int upper = range.UpperBoundExclusive ?? int.MaxValue;

            bool endOfDict = false;
            var entry = dictEnumerator.Current;


            // keys that are below the current range don't belong to any range
            // (or they would have been included in the previous range)
            while (!endOfDict && entry.Key < lower)
            {
                othersTotal += entry.Value;
                endOfDict = !dictEnumerator.MoveNext();
                if (!endOfDict)
                    entry = dictEnumerator.Current;
            }

            // while the key in the the range, we keep adding the values
            while (!endOfDict  && lower <= entry.Key && upper > entry.Key)
            {
                range.Total += entry.Value;
                endOfDict = !dictEnumerator.MoveNext();
                if (!endOfDict)
                    entry = dictEnumerator.Current;
            }

            if (endOfDict) // No more entries in the dictionary, no need to go further
                break;

            // the value of the current entry is now outside the range,
            // so carry on to the next range
        }
    }
}

(updated to take your edit into account; works with non-contiguous ranges, and adds items that don't fall in any range to the "others" range) (更新以将您的编辑考虑在内;与非连续范围一起使用,并将不属于任何范围的项目添加到“其他”范围)

I didn't run any benchmark, but it's probably pretty fast, since the dictionary and the ranges are enumerated only once. 我没有运行任何基准测试,但它可能非常快,因为字典和范围只列举一次。

Obviously, if the ranges are already sorted you don't need the OrderBy on ranges . 显然,如果范围已经排序,则不需要OrderBy on ranges

Consider using sorted List<T> and its BinarySearch method. 考虑使用已排序的List<T>及其BinarySearch方法。 If you have many queries, then each of them can be answered with O(logn) , giving total O(qlogn) time complexity, where n is the number of entries and q number of queries: 如果你有很多查询,那么每个查询都可以用O(logn)来回答,给出总O(qlogn)时间复杂度,其中n是条目数和q个查询数:

//sorted List<int> data

foreach (var range in ranges)                             // O(q)
{
    int lowerBoundIndex = data.BinarySearch(range.Start); // O(logn)
    lowerIndex = lowerIndex < 0
        ? ~lowerIndex
        : lowerIndex;

    int upperBoundIndex = data.BinarySearch(range.End);   // O(logn)
    upperBoundIndex = upperBoundIndex < 0
        ? ~upperBoundIndex - 1
        : upperBoundIndex;

    var count = (upperBoundIndex >= lowerBoundIndex)
        ? (upperBoundIndex - lowerBoundIndex + 1)
        : 0;

    // print/store count for range
}

For the dictionary case, the complexity is on average O(q*l) where q is number of queries (as above) and l is average length of the queried range. 对于字典情况,复杂度平均为O(q*l) ,其中q是查询数(如上所述), l是查询范围的平均长度。 So the sorted list approach will be better if ranges are large. 因此,如果范围很大,排序列表方法会更好。

Anyway, for 100k entries you should use a database, as suggested by pswg in the comments. 无论如何,对于100k条目,您应该使用数据库,如pswg在评论中所建议的那样

You are absolutely right, the dictionary is not the right data structure for the task. 你是绝对正确的,字典不是任务的正确数据结构。

Your idea about what to do is also right. 你对做什么的想法也是对的。 You can improve it with some preprocessing to get the execution time to (N + Q) * Log N , where N is the number of items in the original dictionary, and Q is the number of queries that you need to run. 您可以通过一些预处理来改进它,以使执行时间为(N + Q) * Log N ,其中N是原始字典中的项目数, Q是您需要运行的查询数。

Here is the idea: get the items from your dictionary into a flat list, and sort it. 这是一个想法:将字典中的项目放入平面列表中,然后对其进行排序。 Then preprocess the list by storing the running total in the corresponding node. 然后通过在相应节点中存储运行总计来预处理列表。 Your list would end up looking like this: 您的列表最终会如下所示:

  • | 0 | 0 -> 0 (implicit sentinel value) | 0 - > 0(隐式哨兵值)
  • | 1 | 1 -> 5 -- 5 | 1 - > 5 - 5
  • | 7 | 7 -> 55 -- 50 + 5 | 7 - > 55 - 50 + 5
  • | 30 | 30 -> 58 -- 3 + 50 + 5 | 30 - > 58 - 3 + 50 + 5
  • | 1000 | 1000 -> 59 -- 1 + 3 + 50 + 5 | 1000 - > 59 - 1 + 3 + 50 + 5
  • | 100000 | 100000 -> 94 -- 35 + 1 + 3 + 50 + 5 | 100000 - > 94 - 35 + 1 + 3 + 50 + 5

With the preprocessed list in hand you can run two binary searches on the first list (ie {1, 7, 30, 1000, 100000} list) for the two ends of the query, take the totals at the current point if there was an exact match or at the point before if there wasn't an exact match, subtract the sum at the upper point from the sum at the lower point, and use that as the answer to your query. 使用预处理列表,您可以在第一个列表(即{1, 7, 30, 1000, 100000}列表)上对查询的两端运行两个二进制搜索,如果存在,则在当前点获取总计如果没有完全匹配,则精确匹配或在之前的点,从较低点的总和中减去高点处的总和,并将其用作查询的答案。

For example, if you see the query {0, 10} you process it like this: 例如,如果您看到查询{0, 10}处理:

  • Binary search on 0, get the sentinel value of 0 二进制搜索0,得到的哨兵值为0
  • Binary search on 10, get the value of 55 for 7 (no exact match on 10) 二进制搜索10,得到值为55为7 (10没有完全匹配)
  • Subtract 0 from 55 for an answer of 55. 从55减去0得到55的答案。

For a query 11, 1000 you do this: 对于查询11,1000,您执行此操作:

  • Search 11, get 7 with the value of 55 搜索11,得到7,值为55
  • Search 1000, get 1000 with the value of 59 搜索1000,获得1000,值为59
  • Subtract 59-55=4 for the answer to the query. 减去59-55 = 4以获得查询的答案。

The low-tech approach might be the better approach here. 低技术方法可能是更好的方法。 I'm going to make a possibly invalid assumption that your dictionary doesn't change very often; 我将做一个可能无效的假设,即你的字典不会经常改变; basically that queries are much more frequent than dictionary or range modifications. 基本上,查询比字典或范围修改更频繁。 So you can create and cache a list of the dictionary's keys, refreshing it as required if the dictionary is modified. 因此,您可以创建和缓存字典键的列表,如果修改了字典,则根据需要刷新它。 So, given: 所以,给定:

List<KeyType> keys = dict.Keys.OrderBy(k => k).ToList();
List<RangeType> ranges = rangeList.OrderBy(r => r.LowerBound).ToList();

var iKey = 0;
var iRange = 0;
var count = 0;
// do a merge
while (iKey < keys.Count && iRange < ranges.Count)
{
    if (keys[iKey] < ranges[i].LowerBound)
    {
        // key is smaller than current range's lower bound
        // move to next key

        // here you could add this key to the list of keys not found in any range
        ++iKey;
    }
    else if (keys[iKey] > ranges[i].UpperBound)
    {
        // key is larger than current range's upper bound
        // move to next range
        ++iRange;
    }
    else
    {
        // key is within this range
        ++count;
        // add key to list of keys in this range
        ++iKey;
    }
}
// If there are leftover keys, then add them to the list of keys not found in a range
while (iKey < keys.Count)
{
    notFoundKeys.Add(keys[iKey]);
    ++iKey;
}

Note that this assumes non-overlapping ranges. 请注意,这假设不重叠的范围。

This algorithm is O(n), where n is the number of keys in the dictionary. 该算法是O(n),其中n是字典中的键数。

That might seem expensive, but we're only talking 100,000 comparisons, which is going to be very fast on modern hardware. 这可能看起来很昂贵,但我们只讨论了100,000次比较,这在现代硬件上会非常快。 The beauty of this approach is that it's dead simple to implement and it could very well be fast enough for your purposes. 这种方法的优点在于它实现起来很简单,并且它可以很快地达到您的目的。 It's worth trying. 值得一试。 If it's too slow then you can look at optimization. 如果它太慢,那么你可以看看优化。

An obvious optimization is to binary search the lower and upper bounds to get the indexes of items that fit the range. 一个明显的优化是二进制搜索下限和上限以获得适合该范围的项的索引。 That algorithm's complexity is O(q log n), where q is the number of queries. 该算法的复杂度为O(q log n),其中q是查询的数量。 log2(100000) is approximately 16.6. log2(100000)约为16.6。 It takes two binary searches per query, so looking for 1,000 ranges will require about 33,200 key comparisons--one-third as many as with the sequential algorithm I present above. 每个查询需要两次二进制搜索,因此查找1,000个范围将需要大约33,200次密钥比较 - 是上面提到的顺序算法的三分之一。

That algorithm would look something like: 该算法看起来像:

foreach (var range in ranges)
{
    int firstIndex = keys.BinarySearch(range.LowerBound);

    // See explanation below
    if (firstIndex < 0) firstIndex = ~firstIndex;

    int lastIndex = keys.BinarySearch(range.UpperBound);
    if (lastIndex < 0) lastIndex = ~lastIndex-1;

    if (keys[firstIndex] >= range.LowerBound && keys[lastIndex] <= range.UpperBound)
        count += 1 + (lastIndex - firstIndex);
}

List.BinarySearch returns the bitwise complement of the index where the next larger element would be. List.BinarySearch返回下一个更大元素所在的索引的按位补码。 The code above adjusts the indexes returned if the item isn't found, to get the items that are within range. 上面的代码调整未找到项目时返回的索引,以获取范围内的项目。

Adding the keys not found to a list will involve keeping track of the last key found for each range, and adding that key and everything up to the first key found for the next range to the list of not found keys. 将未找到的密钥添加到列表中将涉及跟踪为每个范围找到的最后一个密钥,并将该密钥和所有内容添加到找到的下一个范围的第一个密钥到未找到的密钥列表。 It's a fairly simple modification of the code above. 这是对上面代码的一个相当简单的修改。

A possible optimization to this algorithm would be to use the BinarySearch overload that lets you specify the starting index. 对该算法的一种可能的优化是使用BinarySearch重载 ,它允许您指定起始索引。 After all, if you've already determined that the range 0-50 ends at index 27, there's no use searching below 27 for the range 51-100. 毕竟,如果你已经确定0-50的范围在索引27处结束,那么在27以下搜索范围51-100是没有用的。 That simple optimization could negate the advantage of the sequential search that I discuss below. 这种简单的优化可能会否定我在下面讨论的顺序搜索的优势。

Although algorithm analysis says that this should be faster, it doesn't take into account the overhead involved in setting up each binary search, or the non-sequential memory access that can be a performance killer due to cache misses. 尽管算法分析表明这应该更快,但它没有考虑设置每个二进制搜索所涉及的开销,或者由于高速缓存未命中而可能成为性能杀手的非顺序存储器访问。 My experiments comparing binary search to sequential search in C# (using List<T>.BinarySearch ) show that sequential search is faster when the list size is less than 10 items, although that depends somewhat on how expensive key comparisons are. 我在C#中比较二进制搜索和顺序搜索的实验(使用List<T>.BinarySearch )表明,当列表大小小于10个项目时,顺序搜索会更快,尽管这在某种程度上取决于密钥比较的成本。 On average, though, I found binary search overhead to be cost me 5 to 10 key comparisons. 不过,平均而言,我发现二进制搜索开销要花费我5到10次密钥比较。 You have to take that into account when you're considering which algorithm would be faster. 当您考虑哪种算法更快时,您必须考虑到这一点。

If the number of ranges is small, the binary search algorithm will be the clear winner. 如果范围的数量很小,二元搜索算法将是明显的赢家。 But it becomes more expensive as the number of ranges grows. 但随着范围数量的增加,它变得更加昂贵。 At some point, the sequential search algorithm, whose running time is nearly constant regardless of the number of ranges, will be faster than the binary search algorithm. 在某些时候,无论范围的数量如何,其运行时间几乎恒定的顺序搜索算法将比二进制搜索算法更快。 Where that point is, exactly, is unclear. 确切地说,这一点并不清楚。 We know that it's something less than 3,000 ranges because n/(2*log2(n)) is equal to 3,012. 我们知道它小于3,000个范围,因为n/(2*log2(n))等于3,012。

Again, since you're talking relatively small numbers, either algorithm will likely perform quite well for you. 再说一次,既然你说的是相对较小的数字,那么任何一种算法都可能对你表现得相当好。 If you're hitting this thing hundreds or thousands of times per second, then you'll want to do a detailed analysis and time execution with representative data and varying numbers of ranges. 如果你每秒数百或数千次击中这个东西,那么你将需要使用代表性数据和不同数量的范围进行详细的分析和时间执行。 If you're hitting it infrequently, then just put in something that works and worry about optimization if it becomes a performance problem. 如果你不经常点击它,那么只要放入一些有效的东西,如果它成为性能问题就会担心优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM