查找在某些情况下需要少于 O(m+n) 次比较的两个已排序数组的交集

Question

Here is one way of doing this in O(m+n) where m and n are lengths of two arrays:这是在O(m+n)中执行此操作的一种方法，其中m和n是两个数组的长度：

import random

def comm_seq(arr_1, arr_2):
    if len(arr_1) == 0 or len(arr_2) == 0:
        return []

    m = len(arr_1) - 1
    n = len(arr_2) - 1

    if arr_1[m] == arr_2[n]:
        return comm_seq(arr_1[:-1], arr_2[:-1]) + [arr_1[m]]

    elif arr_1[m] < arr_2[n]:
        return comm_seq(arr_1, arr_2[:-1])

    elif arr_1[m] > arr_2[n]:
        return comm_seq(arr_1[:-1], arr_2)


if __name__ == "__main__":
    arr_1 = [random.randrange(0,5) for _ in xrange(10)]
    arr_2 = [random.randrange(0,5) for _ in xrange(10)]
    arr_1.sort()
    arr_2.sort()
    print comm_seq(arr_1, arr_2)

Is there a technique that in some cases uses less than O(m+n) comparisons?是否有一种技术在某些情况下使用少于O(m+n)比较？ For example: arr_1=[1,2,2,2,2,2,2,2,2,2,2,100] and arr_2=[1,3,100]例如： arr_1=[1,2,2,2,2,2,2,2,2,2,2,100]和arr_2=[1,3,100]

(Not looking for the hash table implementation) （不是在寻找哈希表实现）

Answer 1

A binary search algorithm requires O(logm) time to find a number in an array with length m.二分搜索算法需要O(logm)时间才能在长度为 m 的数组中找到一个数字。 Therefore, if we search each number of an array with length n from an array with length m, its overall time complexity is O(nlogm) .因此，如果我们从长度为 m 的数组中搜索长度为 n 的数组的每个数字，则其总时间复杂度为O(nlogm) 。 If m is much greater than n , O(nlogm) is actually less than O(m+n) .如果 m 远大于 n ，则O(nlogm)实际上小于O(m+n) 。 Therefore, we can implement a new and better solution based on binary search in such a situation.因此，在这种情况下，我们可以基于二分搜索实现一个新的更好的解决方案。 source 来源

However, this does not necessarily means binary search is better in than O(m+n) case.然而，这并不一定意味着二分搜索在 O(m+n) 的情况下更好。 In fact, binary search approach is only better when n << m (n is very small compared to m).实际上，只有当 n << m（n 与 m 相比非常小）时，二进制搜索方法才会更好。

Answer 2

As far as I know, there are a few different ways to solve this problem, but none of them are better than O(m + n) .据我所知，有几种不同的方法可以解决这个问题，但没有一种比 O(m + n) 更好。 I don't know how you can have an algorithm faster than that (barring weird quantum computing answers), because you have to compare all the elements in both arrays or you might miss a duplicate.我不知道你怎么能有一个比这更快的算法（除非奇怪的量子计算答案），因为你必须比较两个数组中的所有元素，否则你可能会错过重复。

Brute Force Use two nested for loops.蛮力使用两个嵌套的 for 循环。 Take every element from the first array and linear search it in the second array.从第一个数组中取出每个元素并在第二个数组中对其进行线性搜索。 O(M*N) time, O(1) space O(M*N) 时间，O(1) 空间

Map Lookup Use a lookup structure like a hashtable or a binary search tree.地图查找使用查找结构，如哈希表或二叉搜索树。 Put all of the first array into the map structure, then loop through all of the second array and look up each element in the map to see if it exists.将所有第一个数组放入映射结构中，然后遍历所有第二个数组并查找映射中的每个元素以查看它是否存在。 This works whether the arrays are sorted or not.无论数组是否排序，这都有效。 O(M*log(M) + N*log(M)) for Binary Search Tree time or O(M + N) time for Hashtable, both are O(M) space.对于二叉搜索树时间为 O(M*log(M) + N*log(M)) 或对于哈希表为 O(M + N) 时间，两者都是 O(M) 空间。

Binary Search Like brute force, but take every element from the first array and binary search it in the second array.二分搜索类似于蛮力，但从第一个数组中取出每个元素并在第二个数组中对其进行二分搜索。 O(m*log(N)) time, O(1) space O(m*log(N)) 时间，O(1) 空间

Parallel Walk Like the merge part of merge sort. Parallel Walk类似于归并排序的合并部分。 Have two pointers start at the front of each of the arrays.在每个数组的前面有两个指针。 Compare the two elements, if they're equal store the duplicate, otherwise advance the pointer to the smaller value by one spot and repeat until you hit the end of one of the arrays.比较两个元素，如果它们相等，则存储重复项，否则将指针移到较小的值一个位置并重复，直到到达数组之一的末尾。 O(M + N) time, O(1) space O(M + N) 时间，O(1) 空间

Regardless, you must examine every element in both arrays or you won't know if you've found all the duplicates.无论如何，您必须检查两个数组中的每个元素，否则您将不知道是否找到了所有重复项。 You could argue fringe cases where one array is a lot bigger or a lot smaller, but that won't hold for an alogrithm where you're considering all ranges of input.您可以争论一个数组更大或更小的边缘情况，但这不适用于您考虑所有输入范围的算法。

Answer 3

You can use a hash_table to save the large array, and then scan the other small array to calculate the intersection of two array.可以用一个hash_table来保存大数组，然后扫描另一个小数组，计算两个数组的交集。

import random

def comm_seq(arr_1, arr_2):
    if len(arr_1) < len(arr_2): arr_1, arr_2 = arr_2, arr_1
    cnt = {}
    for item in arr_1: 
        cnt.setdefault(item, 0)
        cnt[item] += 1
    # save the large array in a hash_table
    ret = []
    for item in arr_2:
        p = cnt.get(item, 0)
        if p: 
            ret.append(item):
            cnt[item] -= 1
    # scan the small array and get the answer
    return ret

if __name__ == "__main__":
    arr_1 = [random.randrange(0,5) for _ in xrange(10)]
    arr_2 = [random.randrange(0,5) for _ in xrange(10)]
    arr_1.sort()
    arr_2.sort()
    print comm_seq(arr_1, arr_2)

If we consider the complexity of the py-dictionary operating as O(1), the total complexity is O(min(n, m))如果我们考虑py-dictionary操作的复杂度为O(1)，则总复杂度为O(min(n, m))

Answer 4

Algorithm with O(N*log(M/N)) comparisons is possible if you use a combination of one-sided and normal binary search.如果您使用单边搜索和正常二分搜索的组合，则可以使用 O(N*log(M/N)) 比较的算法。 In the worst case (when both arrays are of equal size) this is equal to O(N) = O(M + N) comparisons.在最坏的情况下（当两个数组的大小相同时）这等于 O(N) = O(M + N) 次比较。 Here M is size of the largest array, N is the number of distinct elements in smaller array.这里 M 是最大数组的大小，N 是较小数组中不同元素的数量。

Get the smallest of two arrays and search each of its elements in the second array.获取两个数组中最小的一个，并在第二个数组中搜索其每个元素。 Start with one-sided binary search: try positions M/N, 2*M/N, 4*M/N, ... until an element, larger than necessary is found.从单边二分搜索开始：尝试位置 M/N, 2*M/N, 4*M/N, ... 直到找到一个大于必要的元素。 Then use normal binary search to find an element between positions 0 and 2 ^k *M/N.然后使用正常的二分搜索找到位置 0 和 2 ^k *M/N 之间的元素。

If matching element is found, use the same combination of one-sided and normal binary search to find where the run of duplicate matching elements ends and copy appropriate number of matching elements to output.如果找到匹配元素，则使用单边搜索和普通二分搜索的相同组合来查找重复匹配元素运行的结束位置，并将适当数量的匹配元素复制到输出。 You can use the same combination of binary searches to count the number of duplicate elements in smaller array, and get the minimum of these duplicate counts to determine how much elements should be in the result.您可以使用相同的二进制搜索组合来计算较小数组中重复元素的数量，并获取这些重复计数中的最小值以确定结果中应包含多少元素。

To continue with the next element from smaller array, use starting position in larger array, where the previous step ended.要继续处理较小数组中的下一个元素，请使用较大数组中的起始位置，即上一步结束的位置。

查找在某些情况下需要少于 O(m+n) 次比较的两个已排序数组的交集

问题描述

4 个解决方案

解决方案1
5 2012-11-22 04:08:39

解决方案2
5 2012-11-22 04:23:52

解决方案3
1 2012-11-22 04:19:05

解决方案4
1 2012-11-22 08:44:42

查找在某些情况下需要少于 O(m+n) 次比较的两个已排序数组的交集

问题描述

4 个解决方案

解决方案1 5 2012-11-22 04:08:39

解决方案2 5 2012-11-22 04:23:52

解决方案3 1 2012-11-22 04:19:05

解决方案4 1 2012-11-22 08:44:42

解决方案1
5 2012-11-22 04:08:39

解决方案2
5 2012-11-22 04:23:52

解决方案3
1 2012-11-22 04:19:05

解决方案4
1 2012-11-22 08:44:42