简体   繁体   English

如何在大整数列表中快速找到2个列表元素的第一个多个?

[英]How to quickly find first multiple of 2 of list element in list of large integers?

I have a list of approximately 100 000 sorted even integers in the range 10^12 to 10^14. 我有一个大约10万个排序甚至整数的列表,范围是10 ^ 12到10 ^ 14。 My goal is to find the first integer x in the list such that x*2 is also a member of the list. 我的目标是在列表中找到第一个整数x ,使得x*2也是列表的成员。 As my list is rather long speed is very important. 由于我的列表相当长,速度非常重要。

My first thought was to just iterate over the list and check whether each element when multiplied by 2 was also in the list, but upon implementing this it is clear this is too slow for my purposes. 我的第一个想法是迭代列表并检查每个元素乘以2是否也在列表中,但是在实现这一点时很明显这对我的目的来说太慢了。

My next thought was to decompose each element in the list into its prime decomposition with SymPy's factorint , and then search my decomposed list for the same decomposition except with an extra 2. This didn't turn out to be any quicker obviously, but I feel like there must be a way using prime decomposition, if not something else. 我的下一个想法是使用SymPy的factorint将列表中的每个元素分解为其主要分解,然后搜索我的分解列表以进行相同的分解,除了额外的2。这显然没有更快,但我觉得就像必须有一种使用素数分解的方法,如果不是别的话。

You can iterate on your list with two iterators: one pointing to the current element and one pointing to the first one greater or equal to it's double. 您可以使用两个迭代器迭代列表:一个指向当前元素,一个指向第一个大于或等于它的double。 This will be O(N). 这将是O(N)。

Here is a draft of the idea: 这是一个想法的草稿:

l = [1, 3, 5, 7, 10, 12, 15]

# ...
j = 0
for i in range(0, len(l)):
    while l[j] < 2*l[i]:
        j += 1
        if j == len(l):
            return -1
    if l[j] == 2*l[i]:
        return i

Edit: Following comments in another answer, a few performances tests show that this version will be much faster (3 times in my tests) by eliminating multiplications, calls to len and reducing number of item retrievals in the list: 编辑:在另一个答案的评论之后,一些表演测试表明,通过消除乘法,调用len并减少列表中的项目检索数量,这个版本将更快(在我的测试中为3次):

j = 0
s = len(l)
for i in range(0, s):
    l_i = l[i]
    l_i2 = l_i<<1
    while l[j] < l_i2:
        j += 1
        if j == s:
            return -1
    if l[j] == l_i2:
        return i

The Colin Pitrat's solution is very good, but i guess you can top it when using sortedcontainers.SortedSet Colin Pitrat的解决方案非常好,但我猜你在使用sortedcontainers.SortedSet时可以超越

I have generated a file with 1.000.000 random numbers (using numpy.random.randint) and checked that there is at least one number (in the middle of the file), which satisfies your condition. 我已生成一个包含1.000.000随机数的文件(使用numpy.random.randint),并检查至少有一个数字(在文件中间),它满足您的条件。

Let's compare both approaches: 让我们比较两种方法:

import filecmp
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def save_result(i, fn):
    with open(fn, 'a') as f:
        print(i, file=f)

def test_Colin_Pitrat():
    j = 0
    for i in range(0, len(l)):
        while l[j] < 2*l[i]:
            j += 1
            if j == len(l):
                return -1
        if l[j] == 2*l[i]:
            save_result(l[i], 'Colin_Pitrat.out')
            return l[i]


def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            save_result(i, 'SortedSet.out')
            return i
    return -1


if __name__=='__main__':
    timeit_number = 10000
    l = load2list()
    sorted_set = SortedSet(l)
    print('len(l):\t\t%d' % (len(l)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))
    print('Timeit results with %d executions:' %timeit_number)
    print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number))
    print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number))
    print("filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'):\t%s" % (filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out')))

Output: 输出:

len(l):         1000001
len(sorted_set):        999504
Timeit results with 10000 executions:
Colin_Pitrat:    35.94529931032006
SortedSet:       2.548847197918647
filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'):       True

PS as you can see SortedSet is very fast. PS,你可以看到SortedSet非常快。

UPDATE: (now i'm testing at home, where my PC is much slower , so i will reduce the number of executions to 1.000) 更新:(现在我在家测试,我的PC 速度要慢得多 ,所以我会将执行次数减少到1.000)

As Colin Pitrat proposed i'm generating now data (approx. 100.000 numbers) for the worst scenario - when no match can be found. 正如Colin Pitrat建议我现在为最坏的情况生成数据(大约100.000个数字) - 当时找不到匹配项。 Now i will compare three functions: test_Colin_Pitrat, test_Colin_Pitrat2 (tuned version), test_SortedSet... 现在我将比较三个函数:test_Colin_Pitrat,test_Colin_Pitrat2(调优版),test_SortedSet ...

Data generator script: 数据生成器脚本:

import numpy as np

l = np.random.randint(10**7, 10**9, 200000)
l = l[ l % 2 > 0 ]
np.savetxt('data', np.sort(l), fmt='%d')

Code: 码:

import filecmp
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def save_result(i, fn):
    with open(fn, 'a') as f:
        print(i, file=f)

def test_Colin_Pitrat():
    j = 0
    for i in range(0, len(l)):
        while l[j] < 2*l[i]:
            j += 1
            if j == len(l):
                return -1
        if l[j] == 2*l[i]:
            return l[i]

def test_Colin_Pitrat2():
    j = 0
    s = len(l)
    for i in range(0, s):
        l_i = l[i]
        l_i2 = l_i<<1
        while l[j] < l_i2:
            j += 1
            if j == s:
                return -1
        if l[j] == l_i2:
            return l[j]


def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            return i
    return -1


if __name__=='__main__':
    timeit_number = 1000
    l = load2list()
    sorted_set = SortedSet(l)
    print('len(l):\t\t%d' % (len(l)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))
    print('Timeit results for %d executions:' %timeit_number)
    print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number))
    print('Colin_Pitrat2:\t', Timer(test_Colin_Pitrat2).timeit(timeit_number))
    print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number))

Output: 输出:

len(l):         99909
len(sorted_set):        99899
Timeit results for 1000 executions:
Colin_Pitrat:    153.04753258882357
Colin_Pitrat2:   103.68264272815443
SortedSet:       99.59669211136577

Conclusion: Colin_Pitrat2 is 33% faster compared to Colin_Pitrat and almost as fast as SortedSet. 结论:与Colin_Pitrat相比,Colin_Pitrat2快33%,几乎和SortedSet一样快。

I will add yet another answer, because i've already overloaded my previous one... 我将再添加一个答案,因为我已经超载了我以前的一个......

Now I came up with a new idea - intersection of SortedSets, which works extremely fast comparing to modified Colin's and to my previous solution. 现在我想出了一个新想法 - SortedSets的交集,与修改后的Colin和我以前的解决方案相比,它的工作速度非常快。 The idea is to generate two SortedSets: 想法是生成两个SortedSet:

l2 - is an intersection of two lists/sets: the original one l and the one containing all elements from l multiplied by 2: SortedSet(x*2 for x in l) . l2 - 是两个列表/集合的交集:原始的l和包含l所有元素的一个乘以2: SortedSet(x*2 for x in l)

l1 - is a SortedSet containing all elements belonging to l2 , divided by 2: SortedSet(x//2 for x in l2) l1 - 是一个SortedSet,包含属于l2所有元素,除以2: SortedSet(x//2 for x in l2)

Code: 码:

from datetime import datetime as dt
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def test_Colin_Pitrat2():
    j = 0
    s = len(l)
    for i in range(0, s):
        l_i = l[i]
        l_i2 = l_i<<1
        while l[j] < l_i2:
            j += 1
            if j == s:
                return -1
        if l[j] == l_i2:
            return l[i]

def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            return i
    return -1

def test_SetIntersection():
    for i in l1:
        if i*2 in l2:
            return i
    return -1

if __name__=='__main__':
    start_ts = dt.now()
    timeit_number = 10000
    l =  load2list()
    print('load2list() took:\t%d microseconds' %((dt.now() - start_ts).microseconds))

    start_ts = dt.now()
    sorted_set = SortedSet(l)
    l2 = SortedSet(l).intersection(SortedSet(x*2 for x in l))
    l1 = SortedSet(x//2 for x in l2)
    print('preparing sets took:\t%d microseconds' %((dt.now() - start_ts).microseconds))

    print('len(l):\t\t%d' % (len(l)))
    print('len(l1):\t%d' % (len(l1)))
    print('len(l2):\t%d' % (len(l2)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))

    print('Timeit results for %d executions:' %timeit_number)
    print('Colin_Pitrat2:\t\t', Timer(test_Colin_Pitrat2).timeit(timeit_number))
    print('SortedSet:\t\t', Timer(test_SortedSet).timeit(timeit_number))
    print('SetIntersection:\t', Timer(test_SetIntersection).timeit(timeit_number))

Output: 输出:

load2list() took:       230023 microseconds
preparing sets took:    58106 microseconds
len(l):         498786
len(l1):        256
len(l2):        256
len(sorted_set):        498562
Timeit results for 10000 executions:
Colin_Pitrat2:           23.557948959065648
SortedSet:               6.658937808213555
SetIntersection:         0.012540539222982261

PS i really like this question, because it gave me a chance to learn something new. PS我真的很喜欢这个问题,因为它给了我一个学习新东西的机会。 And I also like the Colin's algorithm - it's smart. 我也喜欢Colin的算法 - 它很聪明。 I've already upvoted it. 我已经赞成了它。

You can divide your list in two so that the lower list contains all numbers from the first one on the original list to the last number that is equal to or smaller than half the maximum value on your list. 您可以将列表分成两部分,以便下面的列表包含从原始列表中第一个到最后一个数字的所有数字,该数字等于或小于列表中最大值的一半。 That will speed up your search a little bit as you will iterate over only half of the list and search if x*2 is on the other half. 这将加快你的搜索速度,因为你将只迭代列表的一半,并搜索x * 2是否在另一半。 See example bellow 见下面的例子

l = [1, 3, 5, 7, 10, 12, 15]

max_l = l[-1]

for i in range(len(l)):
  if l[i] > max_l/2:
    break
  pos_mid_l = i

print pos_mid_l # returns position 3

lower_l = l[:pos_mid_l+1]
upper_l = l[pos_mid_l+1:]

print lower_l   # returns lower half of your list
print upper_l   # returns upper half of your list

for i in  lower_l:
  if i < upper_l[0]/2:
    if i*2 in lower_l:
      ans = i
      break
  else:
    if i*2 in upper_l:
      ans = i
      break

print ans # Your answer :)

It seems to me the best algorithm would use a presorted list and a set. 在我看来,最好的算法将使用预先排序列表和一组。

#!/usr/local/pypy3-2.4.0/bin/pypy3

import time
import random


def get_random_numbers():
    result = []
    for counter in range(99998):
        random_number = (random.randrange(10**12, 10**14) // 2) * 2
        result.append(random_number)
    # Make sure there's at least one solution to the problem
    moderately_large = 10**14 // 2
    result.append(moderately_large)
    result.append(moderately_large * 2)
    # This sort is of course O(n*logn), but I don't think you're
    # counting that, are you?
    result.sort()
    return result


def get_best(list_):
    # This is O(n)
    set_ = set(list_)
    for value in list_:
        if value * 2 in set_:
            return value
    return None


def main():
    list_ = get_random_numbers()
    time0 = time.time()
    result = get_best(list_)
    time1 = time.time()
    print(result)
    print('{} seconds'.format(time1 - time0))


main()

HTH HTH

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM