簡體   English   中英

在列表> 2000000項中查找重復索引的快速方法

[英]Fast method to find indexes of duplicates in a lists >2000000 items

我有一個列表,其中每個項目是兩個事件ID的組合:(這只是一個更大的對列表的片段)

['10000381 10007121','10000381 10008989','10005169 10008989','10008989 10023817','10005169 10043265','10008989 10043265','10023817 10043265','10047097 10047137','10047097 10047265','10047137 10047265' ,'10000381 10056453','10047265 10056453','10000381 10060557','10007121 10060557','10056453 10060557','10000381 10066013','10007121 10066013','10008989 10066013','10026233 10066013','10056453 10066013' ,'10056453 10070153','10060557 10070153','10066013 10070153','10000381 10083798','10047265 10083798','10056453 10083798','10066013 10083798','10000381 10099969','10056453 10099969','10066013 10099969' ,'10070153 10099969','10083798 10099969','10056453 10167029','10066013 10167029','10083798 10167029','10099969 10167029','10182073 10182085','10182073 10182177','10182085 10182177','10000381 10187233' ,'10056453 10187233','10060557 10187233','10066013 10187233','10083798 10187233','10099969 10187233','10167029 10187233','10007121 10200685','10099969 10 200685','10066013 10218005','10223905 10224013']

我需要找到每對id的每個實例並將其索引到一個新列表中。 現在我有幾行代碼為我做這個。 但是,我的列表長度超過2,000,000行,並且隨着我處理更多數據而變得更大。

此時,預計完成時間約為2天。

我真的只需要一個更快的方法。

我在Jupyter筆記本電腦上工作(在Mac筆記本電腦上)

def compiler(idlist):
    groups = []
    for i in idlist:
        groups.append([index for index, x in enumerate(idlist) if x == i])
    return(groups)

我也嘗試過:

def compiler(idlist):
    groups = []
    for k,i in enumerate(idlist):
        position = []
        for c,j in enumerate(idlist):
            if i == j:
                position.append(c)
        groups.append(position)
    return(groups)

我想要的是這樣的:

'10000381 10007121':[0]
'10000381 10008989':[1]
'10005169 10008989':[2,384775,864173,1297105,1321798,1555094,1611064,2078015]
'10008989 10023817':[3,1321800]
'10005169 10043265':[4,29113,864195,1297106,1611081]
[5,864196,2078017]
'10008989 10043265':[6,29​​114,384777,864198,1611085,1840733,2078019]
'10023817 10043265':[7,86626,384780,504434,792690,864215,1297108,1321801,1489784,1524527,1555096,1595763,1611098,1840734,1841280,1929457,1943701,1983362,2093820,2139917,2168437]等。等等

括號中的每個數字都是idlist中該對的索引。

本質上,我希望它查看一對id值(即'10000381 10007121'),並在列表中運行並查找該對的每個實例並記錄該對出現的列表中的每個索引。 我需要為列表中的每個項目執行此操作。 在較短的時間內。

而不是列表,使用dict,這使得查找存在O(1)

def compiler(idlist):
    groups = {}
    for idx, val in enumerate(idlist):
        if val in groups:  
            groups[val].append(idx)
        else:
            groups[val] = [idx]

您可以使用collections.OrderedDict以減少O(n)的時間復雜度。 由於它記住了插入的順序,因此它們按照它們出現的順序類似於各種id:

from collections import OrderedDict

groups = OrderedDict()
for i, v in enumerate(idlist):
    try:
        groups[v].append(i)
    except KeyError:
        groups[v] = [i]

然后list(groups.values())包含您的最終結果。

如果您有大量數據,我建議您使用Pypy3而不是CPython解釋器,您將獲得x5-x7更快的代碼執行。

以下是使用CPythonPypy3進行1000 iterations基於時間的基准測試的實現:

碼:

from time import time
from collections import OrderedDict, defaultdict


def timeit(func, iteration=10000):
    def wraps(*args, **kwargs):
        start = time()
        for _ in range(iteration):
            result = func(*args, **kwargs)
        end = time()
        print("func: {name} [{iteration} iterations] took: {elapsed:2.4f} sec".format(
            name=func.__name__,
            iteration=iteration,
            args=args,
            kwargs=kwargs,
            elapsed=(end - start)
        ))
        return result
    return wraps


@timeit
def op_implementation(data):
    groups = []
    for k in data:
        groups.append([index for index, x in enumerate(data) if x == k])
    return groups


@timeit
def ordreddict_implementation(data):
    groups = OrderedDict()
    for k, v in enumerate(data):
        groups.setdefault(v, []).append(k)
    return groups


@timeit
def defaultdict_implementation(data):
    groups = defaultdict(list)
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        groups[v].append(k)
    return groups


@timeit
def defaultdict_implementation_2(data):
    groups = defaultdict(list)
    for k, v in enumerate(map(lambda x: tuple(x.split()), data)):
        groups[v].append(k)
    return groups


@timeit
def dict_implementation(data):
    groups = {}
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        if v in groups:
            groups[v].append(k)
        else:
            groups[v] = [k]
    return groups



if __name__ == '__main__':
    data = [
        '10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', 
        '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', 
        '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', 
        '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', 
        '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', 
        '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', 
        '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', 
        '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', 
        '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', 
        '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', 
        '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', 
        '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', 
        '10066013 10218005', '10223905 10224013'
    ]
    op_implementation(data)
    ordreddict_implementation(data)
    defaultdict_implementation(data)
    defaultdict_implementation_2(data)
    dict_implementation(data)

CPython的:

func: op_implementation [10000 iterations] took: 1.3096 sec
func: ordreddict_implementation [10000 iterations] took: 0.1866 sec
func: defaultdict_implementation [10000 iterations] took: 0.3311 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.3817 sec
func: dict_implementation [10000 iterations] took: 0.3231 sec

Pypy3:

func: op_implementation [10000 iterations] took: 0.2370 sec
func: ordreddict_implementation [10000 iterations] took: 0.0243 sec
func: defaultdict_implementation [10000 iterations] took: 0.1216 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.1299 sec
func: dict_implementation [10000 iterations] took: 0.1175 sec

具有2000000次迭代的Pypy3:

func: op_implementation [200000 iterations] took: 4.6364 sec
func: ordreddict_implementation [200000 iterations] took: 0.3201 sec
func: defaultdict_implementation [200000 iterations] took: 2.2032 sec
func: defaultdict_implementation_2 [200000 iterations] took: 2.4052 sec
func: dict_implementation [200000 iterations] took: 2.2429 sec

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM