在Python中使用二进制搜索比较列表列表

Question

I have 2 list of lists x(1 million elements) and y(0.1 million elements) and want to get z=xy. 我有2个列表x（1百万个元素）和y（10万个元素）的列表，想要得到z = xy。 each list consist of sub lists of 4 elements each, of which the first element of each sublist is sorted. 每个列表由每个包含4个元素的子列表组成，每个子列表的第一个元素被排序。 The first element are strictly increasing and no duplicates are present. 第一个元素严格增加，没有重复项。 Now I did this using list comprehension and it roughly takes 6.5 hrs to run it. 现在，我使用列表理解来完成此操作，大约需要6.5个小时来运行它。 I wanted to know what is the most time efficient way to do this, keeping in mind that my end result should also be a list of lists. 我想知道什么是最省时的方法，请记住我的最终结果也应该是列表列表。

Secondly, since all my first elements are sorted I thought doing a binary search would be a better idea. 其次，由于我所有的第一个元素都已排序，所以我认为进行二进制搜索将是更好的主意。 Idea of binary search - for ex consider I have 2 lists of size x=30 and y=10 I am looping over elements of y and comparing the first element of each sub list to that of the elemnts in x using binary search, when I find a match that sublist is deleted from the x list. 二进制搜索的想法-例如，我有两个大小分别为x = 30和y = 10的列表，我遍历y的元素，并使用二进制搜索将每个子列表的第一个元素与x中的元素的元素进行比较，当我找到一个匹配项，该子列表已从x列表中删除。 So the expected output list should contain 20 elements.But the code I have written gives me 23(it does not delete the last three matches) and I dont know whats wrong with it. 所以预期的输出列表应该包含20个元素，但是我写的代码给了我23个（它不会删除最后三个匹配项），我也不知道这是怎么回事。 Heres the code: 这是代码：

def intersection(x,y):
    temp=x[:]
    for i in range(len(y)):
        l=0
        h=len(x)-1
        while l<h:
            mid=l+((h-l)/2)
            if y[i][0]==temp[mid][0]:
                a=y[i]
                x.remove(a)
                break
            elif y[i][0]>temp[mid][0]:
                if l==mid:
                    break
                l=mid
            elif y[i][0]<temp[mid][0]:
                h=mid
    return(x)






X-List input of 30 elements
[[1.0, 25.0, 0.0, 0.0]
[2.0, 0.0, 25.0, 0.0]
[3.0, 0.0, 50.0, 0.0]
[4.0, 50.0, 50.0, 0.0]
[5.0, 50.0, 0.0, 0.0]
[6.0, 0.0, 25.0, 10.0]
[7.0, 25.0, 0.0, 10.0]
[8.0, 50.0, 0.0, 10.0]
[9.0, 50.0, 50.0, 10.0]
[10.0, 0.0, 50.0, 10.0]
[11.0, 0.0, 0.0, 0.0]
[12.0, 0.0, 0.0, 10.0]
[13.0, 17.6776695, 17.6776695, 0.0]
[14.0, 0.0, 34.3113632, 0.0]
[15.0, 25.9780293, 50.0, 0.0]
[16.0, 50.0, 25.9780293, 0.0]
[17.0, 34.3113632, 0.0, 0.0]
[18.0, 17.6776695, 17.6776695, 10.0]
[19.0, 34.3113632, 0.0, 10.0]
[20.0, 50.0, 25.9780293, 10.0]
[21.0, 25.9780293, 50.0, 10.0]
[22.0, 0.0, 34.3113632, 10.0]
[23.0, 11.6599302, 0.0, 0.0]
[24.0, 0.0, 11.6599302, 0.0]
[25.0, 0.0, 11.6599302, 10.0]
[26.0, 11.6599302, 0.0, 10.0]
[27.0, 27.9121876, 27.9121876, 0.0]
[28.0, 27.9121876, 27.9121876, 10.0]
[29.0, 9.77920055, 9.77920055, 0.0]
[30.0, 9.77920055, 9.77920055, 10.0]]
Y -List of 10 elements
[1.0, 25.0, 0.0, 0.0]
[2.0, 0.0, 25.0, 0.0]
[11.0, 0.0, 0.0, 0.0]
[13.0, 17.6776695, 17.6776695, 0.0]
[14.0, 0.0, 34.3113632, 0.0]
[17.0, 34.3113632, 0.0, 0.0]
[23.0, 11.6599302, 0.0, 0.0]
[24.0, 0.0, 11.6599302, 0.0]
[27.0, 27.9121876, 27.9121876, 0.0]
[29.0, 9.77920055, 9.77920055, 0.0]
------------------------------------------------------------------------------------------------------------------------------------------Z list (X-Y) the result should be 20 elements but its gives length as 23 elements. it does not remove the remaining 3 elements from the list.




[[3.0, 0.0, 50.0, 0.0],
 [4.0, 50.0, 50.0, 0.0],
 [5.0, 50.0, 0.0, 0.0],
 [6.0, 0.0, 25.0, 10.0],
 [7.0, 25.0, 0.0, 10.0],
 [8.0, 50.0, 0.0, 10.0],
 [9.0, 50.0, 50.0, 10.0],
 [10.0, 0.0, 50.0, 10.0],
 [12.0, 0.0, 0.0, 10.0],
 [15.0, 25.9780293, 50.0, 0.0],
 [16.0, 50.0, 25.9780293, 0.0],
 [18.0, 17.6776695, 17.6776695, 10.0],
 [19.0, 34.3113632, 0.0, 10.0],
 [20.0, 50.0, 25.9780293, 10.0],
 [21.0, 25.9780293, 50.0, 10.0],
 [22.0, 0.0, 34.3113632, 10.0],
 [24.0, 0.0, 11.6599302, 0.0],
 [25.0, 0.0, 11.6599302, 10.0],
 [26.0, 11.6599302, 0.0, 10.0],
 [27.0, 27.9121876, 27.9121876, 0.0],
 [28.0, 27.9121876, 27.9121876, 10.0],
 [29.0, 9.77920055, 9.77920055, 0.0],
 [30.0, 9.77920055, 9.77920055, 10.0]]

Answer 1

If I understand you correctly, use bisect.bisect_left to find the the matches and delete: 如果我对您的理解正确，请使用bisect.bisect_left查找匹配项并删除：

from bisect import bisect_left

for ele in y:
    ind = bisect_left(x, ele)
    if ind < len(x) -1 and x[ind][0] == ele[0]:
        del x[ind]

If you look at the source you can see the code used for bisect_left: 如果您查看源代码，可以看到用于bisect_left的代码：

def bisect_left(a, x, lo=0, hi=None):
    """Return the index where to insert item x in list a, assuming a is sorted.

    The return value i is such that all e in a[:i] have e < x, and all e in
    a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
    insert just before the leftmost x already there.

    Optional args lo (default 0) and hi (default len(a)) bound the
    slice of a to be searched.
    """

    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if a[mid] < x: lo = mid+1
        else: hi = mid
    return lo

You can adapt that into your own code: 您可以将其改编成自己的代码：

def intersection(x, y):
    for ele in y:
        lo = 0
        hi = len(x)
        while lo < hi:
            mid = (lo+hi)//2
            if x[mid] < ele:
                lo = mid+1
            else:
                hi = mid
        if lo < len(x) - 1 and x[ind][0] == ele[0]:
            del x[lo]
    return x

print(len(intersection(x,y)))
20

If you have dupes then you will need to use remove. 如果您有欺骗，那么您将需要使用remove。 Checking the first elements for an exact match is if lo < len(x) - 1 and x[ind][0] == ele[0]: but if you were using remove I don't see how that could work, just because the first elements matched does not mean y[i] was in x so x.remove would fail. 检查前一个元素是否完全匹配是if lo < len(x) - 1 and x[ind][0] == ele[0]:但是如果您使用remove，我看不到它如何工作，只是因为匹配的第一个元素并不意味着y[i]位于x所以x.remove将失败。 So if you are only matching first elements then you can you can change your logic and just iterate over x putting all first elements from each sublist in a set and using a generator expression to update x. 因此，如果仅匹配第一个元素，则可以更改逻辑，并仅对x进行迭代，将每个子列表中的所有第一个元素放入集合中，并使用生成器表达式更新x。

st = {sub[0] for sub in y}

x[:] = (sub for sub in x if sub[0] not in st)

Answer 2

Bisection can work, but another easy solution is to use a set : 二等分可以工作，但是另一个简单的解决方案是使用set ：

y_set = set(tuple(v) for v in y)

Note that the list s have to be turned into something immutable. 请注意， list s必须变成不可变的东西。

Now simply generate the result: 现在只需生成结果：

z = [v for v in x if tuple(v) not in y_set]

This might look very similar to your initial solution, but the lookups here are much faster. 这看起来可能与您的初始解决方案非常相似，但是此处的查找要快得多。

@StefanPochmann has a good point that you might want to base your lookup on something more specific than the whole vector, such as just the first element. @StefanPochmann有一个很好的观点，您可能希望基于比整个向量更特定的内容（例如仅第一个元素）进行查找。 The question wasn't very clear about that (only stating those are sorted). 问题不是很清楚（仅说明已排序）。

Answer 3

If you can use the first elements for filtering: 如果可以使用前几个元素进行过滤：

ykeys = set(zip(*y)[0])
z = [s for s in x if s[0] not in ykeys]

Python 3 versions: Python 3版本：

ykeys = set(list(zip(*y))[0])
ykeys = {s[0] for s in y}

If judging by the first element alone is not enough: 如果仅靠第一个元素判断还不够：

yset = set(map(tuple, y))
return [s for s in x if tuple(s) not in yset]

On my weak laptop, with a test of your size, the first solution takes about 0.4 seconds and the second solution takes about 1 second. 在我性能较弱的笔记本电脑上，通过测试您的尺寸，第一个解决方案大约需要0.4秒，第二个解决方案大约需要1秒。 Not that surprising, since set lookups average O(1) ). 这并不奇怪，因为set查找的平均值为O（1 ）。

Here's a third version, and this one might be the most interesting because it doesn't just let Python do the job and because it's closer to what you intended but even better: 这是第三个版本，这个版本可能是最有趣的，因为它不仅让Python可以完成工作，而且更接近您的预期，甚至更好：

yi, last = 0, len(y) - 1
z = []
for s in x:
    while s > y[yi] and yi < last:
        yi += 1
    if s != y[yi]:
        z.append(s)

This walks over x , and "in parallel" walks over y . 这遍历x ，“并行”遍历y 。 Similar to the merge step of merge-sort. 类似于merge-sort的合并步骤。 With yi we point into y , and we increase it as needed. 使用yi可以指向y ，并根据需要增加它。 Thus we have overall linear time, as we only walk over x from start to end and also over y from start up to end. 因此，我们具有总体线性时间，因为我们从开始到结束只走过x从开始到结束走过y 。 My laptop takes about 0.6 seconds for this, which is faster than my second solution! 我的笔记本电脑为此花费了约0.6秒的时间，这比我的第二个解决方案要快！ (It's not fair to compare it to my first solution, since that one only looks at the first elements). （将它与我的第一个解决方案进行比较是不公平的，因为该解决方案仅关注第一个元素）。

在Python中使用二进制搜索比较列表列表

问题描述

3 个解决方案

解决方案1
0 2015-05-17 22:36:35

解决方案2
0 已采纳 2015-05-17 23:06:26

解决方案3
0 2015-05-17 23:23:07

在Python中使用二进制搜索比较列表列表

问题描述

3 个解决方案

解决方案1 0 2015-05-17 22:36:35

解决方案2 0 已采纳 2015-05-17 23:06:26

解决方案3 0 2015-05-17 23:23:07

解决方案1
0 2015-05-17 22:36:35

解决方案2
0 已采纳 2015-05-17 23:06:26

解决方案3
0 2015-05-17 23:23:07