简体   繁体   English

在最小数量的比较中二进制搜索大数组中的多个不同数字

[英]Binary search for multiple distinct numbers in a large array in minimum number of comparisons

I have a large array of size n (say n = 1000000) with values monotonically non-decreasing. 我有一个大的n数组(比如n = 1000000),其值单调不减。 I have a set of 'k' key values (say k = { 1,23,39,55,..}). 我有一组'k'键值(比如k = {1,23,39,55,..})。 Assume key values are sorted. 假设键值已排序。 I have to find the index of these key values in the large array using minimum number of comparisons. 我必须使用最少的比较数在大数组中找到这些键值的索引。 How do I use binary search to search for multiple unique values? 如何使用二进制搜索来搜索多个唯一值? Doing it separately for each key value takes lot of comparisons. 对每个键值单独执行操作需要进行大量比较。 Can I use reuse some knowledge I learned in one search somehow when I search for another element on the same big array? 当我在同一个大阵列上搜索另一个元素时,我能否以某种方式使用我在一次搜索中学到的知识?

  1. Sort the needles (the values you will search for). 对针进行排序(您将搜索的值)。
  2. Create an array of the same length as the needles, with each element being a pair of indexes. 创建一个与针相同长度的数组,每个元素都是一对索引。 Initialize each pair with {0, len(haystack)} . {0, len(haystack)}初始化每对。 These pairs represent all the knowledge we have of the possible locations of the needles. 这些对代表了我们对针的可能位置的所有知识。
  3. Look at the middle value in the haystack. 看看大海捞针的中间值。 Now do binary search for that value in your needles. 现在在您的针头中进行二进制搜索。 For all lesser needles, set the upper bound (in the array from step 2) to the current haystack index. 对于所有较小的针,将上限(在步骤2的数组中)设置为当前的haystack索引。 For all greater needles, set the lower bound. 对于所有更大的针,设置下限。
  4. While you were doing step 3, keep track of which needle now has the largest range remaining. 当您执行第3步时,请跟踪哪个针现在具有最大范围。 Bisect it and use this as your new middle value to repeat step 3. If the largest range is singular, you're done: all needles have been found (or if not found, their prospective location in the haystack is now known). 将其平分并将其用作新的中间值以重复步骤3.如果最大范围是单数,则完成:已找到所有针(或者如果未找到,则现在已知它们在大海捞针中的预期位置)。

There may be some slight complication here when you have duplicate values in the haystack, but I think once you have the rest sorted out this should not be too difficult. 当你在大海捞针中有重复的值时,这里可能会有一些轻微的复杂情况,但我认为一旦你完成了其余的整理,这应该不会太困难。


I was curious if NumPy implemented anything like this. 如果NumPy实现了这样的话,我很好奇。 The Python name for what you're doing is numpy.searchsorted() , and once you get through the API layers it comes to this : Python的名字,你正在做的是什么numpy.searchsorted()一旦你通过API层得到谈到这个

    /*
     * Updating only one of the indices based on the previous key
     * gives the search a big boost when keys are sorted, but slightly
     * slows down things for purely random ones.
     */
    if (@TYPE@_LT(last_key_val, key_val)) {
        max_idx = arr_len;
    }
    else {
        min_idx = 0;
        max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
    }

So they do not do a full-blown optimization like I described, but they do track when the current needle is greater than the last needle, they can avoid searching the haystack below where the last needle was found. 所以他们没有像我描述的那样进行全面的优化,但他们确实跟踪当前针头是否比最后一根针头更大,他们可以避免搜索最后一根针头被发现的下方的草垛。 This is a simple and elegant improvement over the naive implementation, and as seen from the comments, it must be kept simple and fast because the function does not require the needles to be sorted in the first place. 这是对天真实现的简单而优雅的改进,并且从评论中可以看出,它必须保持简单和快速,因为该功能不需要首先对针进行分类。


By the way: my proposed solution aimed for something like theoretical optimality in big-O terms, but if you have a large number of needles, the fastest thing to do is probably to sort the needles then iterate over the entire haystack and all the needles in tandem: linear-search for the first needle, then resume from there to look for the second, etc. You can even skip every second item in the haystack by recognizing that if a needle is greater than A and less than C, it must belong at position B (assuming you don't care about the left/right insertion order for needles not in the haystack). 顺便说一下:我提出的解决方案的目标是大O方面的理论最优性,但如果你有大量的针头,最快的做法就是对针头进行分类,然后遍历整个草堆和所有的针头串联:线性搜索第一个针,然后从那里继续寻找第二个,等等。你甚至可以通过识别如果一个针大于A且小于C,它必须跳过大海捞针中的每一个项目,它必须属于B位置(假设你不关心不在大海捞针中的左/右插入顺序)。 You can then do about len(haystack)/2 comparisons and the entire thing will be very cache-friendly (after sorting the needles, of course). 然后你可以做len(haystack)/ 2比较,整个事情将非常缓存(当然,在排序针之后)。

One way to reuse knowledge from previous steps is like others suggested: once you have located a key, you can restrict the search ranges for the smaller and larger keys. 重用以前步骤中的知识的一种方法是像其他人建议的那样:一旦找到了键,就可以限制较小和较大键的搜索范围。

Assuming N=2^n, K=2^k and lucky outcomes: after finding the middle key, (n comparisons), you have two subarrays of size N/2. 假设N = 2 ^ n,K = 2 ^ k并且幸运结果:在找到中间密钥(n比较)之后,您有两个大小为N / 2的子阵列。 Perform 2 searches for the "quartile" keys (n-1 comparisons each), reducing to N/4 subarrays... 执行2次搜索“四分位”键(每次n-1次比较),减少到N / 4个子阵列......

In total, n + 2(n-1) + 4(n-2) + ... + 2^(k-1)(n-k+1) comparisons. 总共,n + 2(n-1)+ 4(n-2)+ ... + 2 ^(k-1)(n-k + 1)比较。 After a bit of math, this equals roughly Kn-Kk = K.(nk). 经过一些数学计算,这大致等于Kn-Kk = K.(nk)。

This is a best case scenario and the savings are not so significant compared to independent searches (Kn comparisons). 这是一个最好的情况,与独立搜索(Kn比较)相比,节省的费用并不那么显着。 Anyway, the worst case (all searches resulting in imbalanced partitions) is not worse than independent searches. 无论如何,最糟糕的情况(所有搜索导致不平衡的分区)并不比独立搜索差。

UPDATE : this is an instance of the Minimum Comparison Merging problem 更新 :这是最小比较合并问题的一个实例

Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences. 在N个值的数组中查找K个键的位置与合并两个排序的序列相同。

From Knuth Vol. 来自Knuth Vol。 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K))) comparisons are required (because there are C(N+K,K) ways to intersperse the keys in the array). 3,第5.3.2节,我们知道至少需要ceiling(lg(C(N+K,K)))比较(因为有C(N+K,K)方式来散布数组中的键) 。 When K is much smaller than N, this is close to lg((N^K/K!) , or K lg(N) - K lg(K) = K.(nk) . 当K远小于N时,这接近lg((N^K/K!) ,或K lg(N) - K lg(K) = K.(nk)

This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys. 任何这样的算法都不会被任何基于比较的方法打败,因此任何这样的算法都需要基本上与键的数量成比例的时间。

  1. Sort needles. 排针。
  2. Search for first needle 搜索第一针
  3. Update lower bound of haystack with search result 使用搜索结果更新haystack的下限
  4. Search for last needle 搜索最后一针
  5. Update upper bound of haystack with search result 使用搜索结果更新haystack的上限
  6. Go 2. 去2。

While not optimal it is much easier to implement. 虽然不是最佳的,但实施起来要容易得多。

If you have array of ints, and you want to search for minimum number of comparisons, I want to suggest you interpolation search from Knuth, 6.2.1. 如果你有一组整数,并且你想搜索最小数量的比较,我想建议你从Knuth,6.2.1进行插值搜索。 If binary search requires Log(N) iterations (and comparisons), interpolation search requires only Log(Log(N)) operations. 如果二进制搜索需要Log(N)次迭代(和比较),则插值搜索仅需要Log(Log(N))操作。

For details and code sample see: 有关细节和代码示例,请参阅:

http://en.wikipedia.org/wiki/Interpolation_search http://en.wikipedia.org/wiki/Interpolation_search

http://xlinux.nist.gov/dads//HTML/interpolationSearch.html http://xlinux.nist.gov/dads//HTML/interpolationSearch.html

I know the question was regarding C, but I just did an implementation of this in Javascript I thought I'd share. 我知道问题是关于C,但我只是在Javascript中实现了这个,我以为我会分享。 Not intended to work if you have duplicate elements in the array...I think it will just return any of the possible indexes in that case. 如果你在数组中有重复的元素,则无意工作...我认为在这种情况下它只会返回任何可能的索引。 For an array with 1 million elements where you search for each element its about 2.5x faster. 对于包含100万个元素的数组,您可以在其中搜索每个元素,其速度提高约2.5倍。 If you also search for elements that are not contained in the array then its even faster. 如果您还搜索未包含在数组中的元素,那么它甚至更快。 In one data set I through at it it was several times faster. 在一个数据集中,我通过它的速度要快几倍。 For small arrays its about the same 对于小阵列,它大致相同

        singleSearch=function(array, num) {
            return this.singleSearch_(array, num, 0, array.length)
        }

        singleSearch_=function(array, num, left, right){
            while (left < right) {
                var middle =(left + right) >> 1;
                var midValue = array[middle];

                if (num > midValue) {
                    left = middle + 1;
                } else {
                    right = middle;
                }
            }
            return left;
        };


        multiSearch=function(array, nums) {
            var numsLength=nums.length;
            var results=new Int32Array(numsLength);
            this.multiSearch_(array, nums, 0, array.length, 0, numsLength, results);
            return results;
        };

        multiSearch_=function(array, nums, left, right, numsLeft, numsRight, results) {
            var middle = (left + right) >> 1;
            var midValue = array[middle];
            var numsMiddle = this.singleSearch_(nums, midValue, numsLeft, numsRight);
            if ((numsRight - numsLeft) > 1) {
                if (middle + 1 < right) {
                    var newLeft = middle;
                    var newRight = middle;
                    if ((numsRight - numsMiddle) > 0) {
                        this.multiSearch_(array, nums, newLeft, right, numsMiddle, numsRight, results);
                    }
                    if (numsMiddle - numsLeft > 0) {
                        this.multiSearch_(array, nums, left, newRight, numsLeft, numsMiddle, results);
                    }
                }
                else {
                    for (var i = numsLeft; i < numsRight; i++) {
                        var result = this.singleSearch_(array, nums[i], left, right);
                        results[i] = result;
                    }
                }
            }
            else {
                var result = this.singleSearch_(array, nums[numsLeft], left, right);
                results[numsLeft] = result;
            };
        }

// A recursive binary search based function. //基于递归二进制搜索的函数。 It returns index of x in // given array arr[l..r] is present, otherwise -1. 它返回给定数组中的x的索引arr [l..r]存在,否则为-1。

int binarySearch(int arr[], int l, int r, int x)
{
   if (r >= l)
   {
        int mid = l + (r - l)/2;

        // If the element is present at one of the middle 3 positions
        if (arr[mid] == x)  return mid;
        if (mid > l && arr[mid-1] == x) return (mid - 1);
        if (mid < r && arr[mid+1] == x) return (mid + 1);

        // If element is smaller than mid, then it can only be present
        // in left subarray
        if (arr[mid] > x) return binarySearch(arr, l, mid-2, x);

        // Else the element can only be present in right subarray
        return binarySearch(arr, mid+2, r, x);
   }

   // We reach here when element is not present in array
   return -1;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM