简体   繁体   中英

Binary search for multiple distinct numbers in a large array in minimum number of comparisons

I have a large array of size n (say n = 1000000) with values monotonically non-decreasing. I have a set of 'k' key values (say k = { 1,23,39,55,..}). Assume key values are sorted. I have to find the index of these key values in the large array using minimum number of comparisons. How do I use binary search to search for multiple unique values? Doing it separately for each key value takes lot of comparisons. Can I use reuse some knowledge I learned in one search somehow when I search for another element on the same big array?

  1. Sort the needles (the values you will search for).
  2. Create an array of the same length as the needles, with each element being a pair of indexes. Initialize each pair with {0, len(haystack)} . These pairs represent all the knowledge we have of the possible locations of the needles.
  3. Look at the middle value in the haystack. Now do binary search for that value in your needles. For all lesser needles, set the upper bound (in the array from step 2) to the current haystack index. For all greater needles, set the lower bound.
  4. While you were doing step 3, keep track of which needle now has the largest range remaining. Bisect it and use this as your new middle value to repeat step 3. If the largest range is singular, you're done: all needles have been found (or if not found, their prospective location in the haystack is now known).

There may be some slight complication here when you have duplicate values in the haystack, but I think once you have the rest sorted out this should not be too difficult.


I was curious if NumPy implemented anything like this. The Python name for what you're doing is numpy.searchsorted() , and once you get through the API layers it comes to this :

    /*
     * Updating only one of the indices based on the previous key
     * gives the search a big boost when keys are sorted, but slightly
     * slows down things for purely random ones.
     */
    if (@TYPE@_LT(last_key_val, key_val)) {
        max_idx = arr_len;
    }
    else {
        min_idx = 0;
        max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
    }

So they do not do a full-blown optimization like I described, but they do track when the current needle is greater than the last needle, they can avoid searching the haystack below where the last needle was found. This is a simple and elegant improvement over the naive implementation, and as seen from the comments, it must be kept simple and fast because the function does not require the needles to be sorted in the first place.


By the way: my proposed solution aimed for something like theoretical optimality in big-O terms, but if you have a large number of needles, the fastest thing to do is probably to sort the needles then iterate over the entire haystack and all the needles in tandem: linear-search for the first needle, then resume from there to look for the second, etc. You can even skip every second item in the haystack by recognizing that if a needle is greater than A and less than C, it must belong at position B (assuming you don't care about the left/right insertion order for needles not in the haystack). You can then do about len(haystack)/2 comparisons and the entire thing will be very cache-friendly (after sorting the needles, of course).

One way to reuse knowledge from previous steps is like others suggested: once you have located a key, you can restrict the search ranges for the smaller and larger keys.

Assuming N=2^n, K=2^k and lucky outcomes: after finding the middle key, (n comparisons), you have two subarrays of size N/2. Perform 2 searches for the "quartile" keys (n-1 comparisons each), reducing to N/4 subarrays...

In total, n + 2(n-1) + 4(n-2) + ... + 2^(k-1)(n-k+1) comparisons. After a bit of math, this equals roughly Kn-Kk = K.(nk).

This is a best case scenario and the savings are not so significant compared to independent searches (Kn comparisons). Anyway, the worst case (all searches resulting in imbalanced partitions) is not worse than independent searches.

UPDATE : this is an instance of the Minimum Comparison Merging problem

Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences.

From Knuth Vol. 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K))) comparisons are required (because there are C(N+K,K) ways to intersperse the keys in the array). When K is much smaller than N, this is close to lg((N^K/K!) , or K lg(N) - K lg(K) = K.(nk) .

This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys.

  1. Sort needles.
  2. Search for first needle
  3. Update lower bound of haystack with search result
  4. Search for last needle
  5. Update upper bound of haystack with search result
  6. Go 2.

While not optimal it is much easier to implement.

If you have array of ints, and you want to search for minimum number of comparisons, I want to suggest you interpolation search from Knuth, 6.2.1. If binary search requires Log(N) iterations (and comparisons), interpolation search requires only Log(Log(N)) operations.

For details and code sample see:

http://en.wikipedia.org/wiki/Interpolation_search

http://xlinux.nist.gov/dads//HTML/interpolationSearch.html

I know the question was regarding C, but I just did an implementation of this in Javascript I thought I'd share. Not intended to work if you have duplicate elements in the array...I think it will just return any of the possible indexes in that case. For an array with 1 million elements where you search for each element its about 2.5x faster. If you also search for elements that are not contained in the array then its even faster. In one data set I through at it it was several times faster. For small arrays its about the same

        singleSearch=function(array, num) {
            return this.singleSearch_(array, num, 0, array.length)
        }

        singleSearch_=function(array, num, left, right){
            while (left < right) {
                var middle =(left + right) >> 1;
                var midValue = array[middle];

                if (num > midValue) {
                    left = middle + 1;
                } else {
                    right = middle;
                }
            }
            return left;
        };


        multiSearch=function(array, nums) {
            var numsLength=nums.length;
            var results=new Int32Array(numsLength);
            this.multiSearch_(array, nums, 0, array.length, 0, numsLength, results);
            return results;
        };

        multiSearch_=function(array, nums, left, right, numsLeft, numsRight, results) {
            var middle = (left + right) >> 1;
            var midValue = array[middle];
            var numsMiddle = this.singleSearch_(nums, midValue, numsLeft, numsRight);
            if ((numsRight - numsLeft) > 1) {
                if (middle + 1 < right) {
                    var newLeft = middle;
                    var newRight = middle;
                    if ((numsRight - numsMiddle) > 0) {
                        this.multiSearch_(array, nums, newLeft, right, numsMiddle, numsRight, results);
                    }
                    if (numsMiddle - numsLeft > 0) {
                        this.multiSearch_(array, nums, left, newRight, numsLeft, numsMiddle, results);
                    }
                }
                else {
                    for (var i = numsLeft; i < numsRight; i++) {
                        var result = this.singleSearch_(array, nums[i], left, right);
                        results[i] = result;
                    }
                }
            }
            else {
                var result = this.singleSearch_(array, nums[numsLeft], left, right);
                results[numsLeft] = result;
            };
        }

// A recursive binary search based function. It returns index of x in // given array arr[l..r] is present, otherwise -1.

int binarySearch(int arr[], int l, int r, int x)
{
   if (r >= l)
   {
        int mid = l + (r - l)/2;

        // If the element is present at one of the middle 3 positions
        if (arr[mid] == x)  return mid;
        if (mid > l && arr[mid-1] == x) return (mid - 1);
        if (mid < r && arr[mid+1] == x) return (mid + 1);

        // If element is smaller than mid, then it can only be present
        // in left subarray
        if (arr[mid] > x) return binarySearch(arr, l, mid-2, x);

        // Else the element can only be present in right subarray
        return binarySearch(arr, mid+2, r, x);
   }

   // We reach here when element is not present in array
   return -1;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM