简体   繁体   English

用于构建和查找整数范围集的数据结构

[英]Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 我有一组uint32整数,集合中可能有数百万个项目。 50-70% of them are consecutive, but in input stream they appear in unpredictable order. 其中50-70%是连续的,但在输入流中它们以不可预测的顺序出现。

I need to: 我需要:

  1. Compress this set into ranges to achieve space efficient representation. 将此集合压缩到范围内以实现空间有效表示。 Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. 已经使用普通算法实现了这一点,因为只计算一次速度的范围在这里并不重要。 After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course. 在此转换之后,所得范围的数量通常在5 000-10 000之间,当然,其中许多是单项。

  2. Test membership of some integer, information about specific range in the set is not required. 测试某些整数的成员资格,不需要有关集合中特定范围的信息。 This one must be very fast -- O(1). 这个必须非常快 - O(1)。 Was thinking about minimal perfect hash functions , but they do not play well with ranges. 正在考虑最小的完美哈希函数 ,但它们不适合范围。 Bitsets are very space inefficient. 位集空间效率很低。 Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance. 其他结构,如二叉树,具有O(log n)的复杂性,最糟糕的是它们实现了许多条件跳转,而处理器无法很好地预测它们,从而导致性能不佳。

Is there any data structure or algorithm specialized in integer ranges to solve this task? 是否有专门用于整数范围的数据结构或算法来解决此任务?

Regarding the second issue: 关于第二个问题:

You could look-up on Bloom Filters . 你可以查看Bloom Filters Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p). 布隆过滤器是专门用来回答O(1)会员的问题,虽然响应或者是nomaybe (这是不那么明朗的是/否:P)。

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright. 当然,在maybe情况下,您需要进一步处理以实际回答问题(除非您的情况下概率答案已足够),但即使如此,布隆过滤器也可以充当守门员,并且完全拒绝大多数查询。

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures. 此外,您可能希望在不同结构中保留实际范围和退化范围(单个元素)。

  • single elements may be best stored in a hash-table 单个元素可以最好地存储在散列表中
  • actual ranges can be stored in a sorted array 实际范围可以存储在排序数组中

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. 这减少了存储在排序数组中的元素数量,从而减少了在那里执行的二进制搜索的复杂性。 Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10 由于你声明许多范围是退化的,我认为你只有500-1000个范围(即,一个数量级减少),并且log(1000)~10

I would therefore suggest the following steps: 因此,我建议采取以下步骤:

  • Bloom Filter: if no, stop Bloom Filter:如果没有,请停止
  • Sorted Array of real ranges: if yes, stop 实数范围的排序数组:如果是,则停止
  • Hash Table of single elements 哈希表的单个元素

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in aa few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :) 排序阵列测试首先执行,因为如果包含一个数字,你给出的数字(在几千个范围内合并的数百万个数字),它可能在一个范围而不是单个:)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. 最后一个注意事项:提防O(1),虽然看起来很吸引人,但你不是在渐近的情况下。 Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :) 只有5000-10000的范围很少,因为log(10000)就像13那样。所以不要通过获得具有如此高的常数因子的O(1)解决方案来使你的实现失望,它实际上比O(log N)运行得慢。 )解决方案:)

If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. 如果您事先知道范围是什么,那么您可以使用下面概述的策略检查给定整数是否存在于O(lg n)中的一个范围内。 It's not O(1), but it's still quite fast in practice. 它不是O(1),但在实践中它仍然很快。

The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. 这种方法背后的想法是,如果你将所有范围合并在一起,你就会在数字行上有一系列不相交的范围。 From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. 从那里,您可以通过说区间[a,b]≤[c,d] iffb≤c来定义这些区间的排序。 This is a total ordering because all of the ranges are disjoint. 这是一个总排序,因为所有范围都是不相交的。 You can thus put all of the intervals together into a static array and then sort them by this ordering. 因此,您可以将所有间隔放在一起形成静态数组,然后按此顺序对它们进行排序。 This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. 这意味着最左边的间隔位于数组的第一个插槽中,最右边的间隔位于最右边的插槽中。 This construction takes O(n lg n) time. 这种结构需要O(n lg n)时间。

To check if a some interval contains a given integer, you can do a binary search on this array. 要检查某个间隔是否包含给定的整数,可以对此数组执行二进制搜索。 Starting at the middle interval, check if the integer is contained in that interval. 从中间间隔开始,检查该间隔中是否包含整数。 If so, you're done. 如果是这样,你就完成了。 Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. 否则,如果该值小于范围中的最小值,则继续左侧的搜索,如果该值大于该范围中的最大值,则继续右侧的搜索。 This is essentially a standard binary search, and it should run in O(lg n) time. 这本质上是一个标准的二进制搜索,它应该在O(lg n)时间内运行。

Hope this helps! 希望这可以帮助!

AFAIK there is no such algorithm that search over integer list in O(1). AFAIK没有这样的算法可以搜索O(1)中的整数列表。

One only can do O(1) search with vast amount of memory. 只能用大量的内存进行O(1)搜索。

So it is not very promising to try to find O(1) search algorithm over list of range of integer. 因此,尝试在整数范围列表上找到O(1)搜索算法并不是很有希望。

On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table). 另一方面,您可以通过仔细检查数据集(最终构建一种哈希表)来尝试时间/内存权衡方法。

You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. 您可以使用y-fast树或van Emde Boas树来实现O(lg w)时间查询,其中w是单词中的位数,您可以使用融合树来实现O(lg_w n)时间查询。 The optimal tradeoff in terms of n is O(sqrt(lg(n))). 以n表示的最佳权衡是O(sqrt(lg(n)))。

The easiest of these to implement is probably y-fast trees. 这些中最容易实现的可能是y-fast树。 They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster. 它们可能比进行二进制搜索更快,尽管它们需要大约O(lg w)= O(lg 32)= O(5)哈希表查询,而二进制搜索大致需要O(lg n)= O(lg 10000)= O(13)比较,因此二进制搜索可能更快。

Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ), You need to work on 'radix' based storage/retrieval . 您需要处理基于'radix'的存储/检索,而不是基于“比较”的存储/检索(始终为O(log(n)))。

In other words .. extract nibbles from the uint32, and make a trie .. 换句话说..从uint32中提取半字节,并制作一个trie ..

Keep your ranges into a sorted array and use binary search for lookups. 将范围保持为已排序的数组,并使用二进制搜索进行查找。

It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster. 它很容易实现,O(log N),并且使用更少的内存,并且比任何其他基于树的方法需要更少的内存访问,因此它可能也会更快。

From the description of you problem it sounds like the following might be a good compromise. 从您对问题的描述来看,听起来以下可能是一个很好的妥协。 I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer. 我使用面向对象语言描述它,但可以使用联合类型或带有类型成员和指针的结构轻松转换为C.

Use the first 16 bits to index an array of objects (of size 65536). 使用前16位索引对象数组(大小为65536)。 In that array there are 5 possible objects 在该数组中有5个可能的对象

  • a NONE object means no elements beginning with those 16bits are in the set 一个NONE对象意味着没有以16bits开头的元素在集合中
  • an ALL object means all elements beginning with 16 bits are in the set ALL对象表示所有以16位开头的元素都在集合中
  • a RANGE object means all items with the final 16bits between an upper and lower bound are in the set RANGE对象表示在上限和下限之间具有最终16位的所有项目都在集合中
  • a SINGLE object means just one element beginning with the 16bits is in the array SINGLE对象意味着只有一个以16位开头的元素在数组中
  • a BITSET object handles all remaining cases with a 65536 bit bitset BITSET对象使用65536位位集处理所有剩余的情况

Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. 当然,您不需要以16位分割,您可以调整以反映您的集合的统计数据。 In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties. 事实上,你不需要使用连续的位,但是它会加速钻头的速度,如果你声称的许多元素是连续的,那么它将提供良好的属性。

Hopefully this makes sense, please comment if I need to explain more fully. 希望这是有道理的,如果我需要更充分地解释,请发表评论。 Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. 实际上,您已经将深度2二叉树与范围和位集相结合,以进行时间/速度权衡。 If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time. 如果您需要节省内存,则使树更深,查找时间略有增加。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM