[英]Comparing two large lists to find element pairs that meet a condition in python
I have two large lists of numbers (maybe a million elements each). 我有两个大数字列表(每个数字可能有一百万个元素)。 I would like to compare both of them element-wise to identify element pairs that have a difference of less than 0.5.
我想对两个元素进行逐元素比较,以识别差异小于0.5的元素对。 I know two nested for loops is not an option.
我知道两个嵌套的for循环不是一个选择。 Is there any quick way to do this using sets or zip?
有什么快速的方法使用集或邮编吗?
For eg. 例如。 if my lists are
list1 = [1,2,3,4]
and list2 = [3,4,5,6]
and the condition is difference of 1, then the solution would have the pairs arranged in a list [element from list1
, element from list2
, difference]. 如果我的列表是
list1 = [1,2,3,4]
和list2 = [3,4,5,6]
并且条件是1的差,则解决方案将对安排在列表中[来自list1
元素,来自list2
元素,差异]。 The solution would be [[2,3,1],[3,3,0],[3,4,1],[4,3,1],[4,4,0],[4,5,1]]
解将是
[[2,3,1],[3,3,0],[3,4,1],[4,3,1],[4,4,0],[4,5,1]]
Thanks 谢谢
Basically, my idea is to sort the two list O(nlogn), and then go through the list , keeping in memory the distance with the next element, and therefore, not computing all the pairs, but only a subset giving me a O(2*m*n) m being the maximum distance allowed 基本上,我的想法是对两个列表O(nlogn)进行排序,然后遍历该列表,在内存中保留与下一个元素的距离,因此,不计算所有对,而是仅计算一个给我O( 2 * m * n)m是允许的最大距离
x = sorted([0, 2, 3, 4])
y = sorted([1,3, 4, 5, 6])
index = 0
delta = 1
output = []
j = 0
value_2 = y[0]
no_more = False
number_of_operation = 0
for i,value_1 in enumerate(x[:]):
print(f'Testing for this {value_1}')
skip = False
try:
next_value_at = x[i+1] - value_1
if next_value_at > delta:
skip = True
print('We can directly skip to next')
except:
print('At the end of list')
while value_2 - value_1 <= delta:
number_of_operation+=1
print(value_1,value_2)
try:
if abs(value_1 - value_2) <= delta:
output += [[value_1,value_2,value_1-value_2]]
j+=1
value_2 = y[j]
print(value_1,value_2)
continue
except:
no_more = True
print('end of list')
break
if not skip:
print("Going back")
j=index
value_2 = y[index]
else:
index = j
if no_more:
print('end')
break
print(number_of_operation)
Use numpy's broadcasting 使用numpy的广播
import numpy as np
x = np.array([1, 2, 3, 4]).reshape(-1, 1)
y = np.array([3, 4, 5, 6]).reshape(1, -1)
diff = x - y
However, you can't avoid N^2 comparisons, only take advantage of numpy's speed optimizations. 但是,您不能避免N ^ 2比较,而只能利用numpy的速度优化。
You might be able to avoid the O(N²) behavior if you sort your lists first (or better yet if your lists are already sorted). 如果您首先对列表进行排序,则可以避免O(N²)行为(如果列表已经排序,则更好)。 Then you can step through them element wise.
然后,您可以在元素方面逐步解决它们。 This would give you O(nLogn) for the sorts plus O(n) to step through the elements.
这将为您提供O(nLogn)进行排序,再加上O(n)即可遍历元素。 For example:
例如:
list1 = range(0, 1000000)
list2 = range(999999, 1999999)
def getClose(list1, list2):
c1, c2 = 0, 0
while c1 < len(list1) and c2 < len(list2):
if abs(list1[c1] - list2[c2]) <= 1:
yield (list1[c1], list2[c2], abs(list1[c1] - list2[c2]))
if list1[c1] < list2[c2]:
c1 += 1
else:
c2 += 1
for n in getClose(list1, list2):
print(n)
Produces... 产生...
999998 999999 1
999998 999999 1
999999 999999 0999999 999999 0
999999 1000000 1999999 1000000 1
...relatively quickly and much quicker than finding the product first. ...比首先找到产品要快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.