简体   繁体   English

比较两个大列表以查找满足python中条件的元素对

[英]Comparing two large lists to find element pairs that meet a condition in python

I have two large lists of numbers (maybe a million elements each). 我有两个大数字列表(每个数字可能有一百万个元素)。 I would like to compare both of them element-wise to identify element pairs that have a difference of less than 0.5. 我想对两个元素进行逐元素比较,以识别差异小于0.5的元素对。 I know two nested for loops is not an option. 我知道两个嵌套的for循环不是一个选择。 Is there any quick way to do this using sets or zip? 有什么快速的方法使用集或邮编吗?

For eg. 例如。 if my lists are list1 = [1,2,3,4] and list2 = [3,4,5,6] and the condition is difference of 1, then the solution would have the pairs arranged in a list [element from list1 , element from list2 , difference]. 如果我的列表是list1 = [1,2,3,4]list2 = [3,4,5,6]并且条件是1的差,则解决方案将对安排在列表中[来自list1元素,来自list2元素,差异]。 The solution would be [[2,3,1],[3,3,0],[3,4,1],[4,3,1],[4,4,0],[4,5,1]] 解将是[[2,3,1],[3,3,0],[3,4,1],[4,3,1],[4,4,0],[4,5,1]]

Thanks 谢谢

This should work. 这应该工作。 (Criticism appreciated) (批评表示赞赏)

Basically, my idea is to sort the two list O(nlogn), and then go through the list , keeping in memory the distance with the next element, and therefore, not computing all the pairs, but only a subset giving me a O(2*m*n) m being the maximum distance allowed 基本上,我的想法是对两个列表O(nlogn)进行排序,然后遍历该列表,在内存中保留与下一个元素的距离,因此,不计算所有对,而是仅计算一个给我O( 2 * m * n)m是允许的最大距离

x = sorted([0, 2, 3, 4])
y = sorted([1,3, 4, 5, 6])
index = 0
delta = 1
output = []
j = 0
value_2 = y[0]
no_more = False
number_of_operation = 0
for i,value_1 in enumerate(x[:]):
    print(f'Testing for this {value_1}')
    skip = False
    try:
        next_value_at = x[i+1] - value_1 
        if next_value_at > delta:
            skip = True
            print('We can directly skip to next')
    except:
        print('At the end of list')
    while value_2 - value_1 <= delta:
        number_of_operation+=1
        print(value_1,value_2)
        try:
            if abs(value_1 - value_2) <= delta:
                output += [[value_1,value_2,value_1-value_2]]
            j+=1
            value_2 = y[j]
            print(value_1,value_2) 
            continue
        except:
            no_more = True
            print('end of list')
            break
    if not skip:
        print("Going back")
        j=index
        value_2 = y[index]
    else:
        index = j
    if no_more:
        print('end')
        break
    print(number_of_operation)

Use numpy's broadcasting 使用numpy的广播

import numpy as np
x = np.array([1, 2, 3, 4]).reshape(-1, 1)
y = np.array([3, 4, 5, 6]).reshape(1, -1)
diff = x - y

However, you can't avoid N^2 comparisons, only take advantage of numpy's speed optimizations. 但是,您不能避免N ^ 2比较,而只能利用numpy的速度优化。

You might be able to avoid the O(N²) behavior if you sort your lists first (or better yet if your lists are already sorted). 如果您首先对列表进行排序,则可以避免O(N²)行为(如果列表已经排序,则更好)。 Then you can step through them element wise. 然后,您可以在元素方面逐步解决它们。 This would give you O(nLogn) for the sorts plus O(n) to step through the elements. 这将为您提供O(nLogn)进行排序,再加上O(n)即可遍历元素。 For example: 例如:

list1 = range(0, 1000000)
list2 = range(999999, 1999999)

def getClose(list1, list2):
    c1, c2 = 0, 0
    while c1 < len(list1) and c2 < len(list2):
        if abs(list1[c1] - list2[c2]) <= 1:
            yield (list1[c1], list2[c2], abs(list1[c1] - list2[c2]))
        if list1[c1] < list2[c2]:
            c1 += 1
        else:
            c2 += 1

for n in getClose(list1, list2):
    print(n)

Produces... 产生...

999998 999999 1 999998 999999 1
999999 999999 0 999999 999999 0
999999 1000000 1 999999 1000000 1

...relatively quickly and much quicker than finding the product first. ...比首先找到产品要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM