简体   繁体   English

获得列表联合的最快方法 - Python

[英]Fastest way to get union of lists - Python

There's a C++ comparison to get union of lists from lists of lists: The fastest way to find union of sets 有一个C ++比较从列表列表中获取列表的并集找到集合的最快方法

And there's several other python related questions but none suggest the fastest way to unionize the lists: 还有其他几个与python相关的问题,但没有一个提出了加入列表的最快方法:

From the answers, I've gathered that there are at least 2 ways to do it: 从答案中,我发现至少有两种方法可以做到:

>>> from itertools import chain
>>> x = [[1,2,3], [3,4,5], [1,7,8]]
>>> list(set().union(*x))
[1, 2, 3, 4, 5, 7, 8]
>>> list(set(chain(*x)))
[1, 2, 3, 4, 5, 7, 8]

Note that I'm casting the set to list afterwards because I need the order of the list to be fixed for further processing. 请注意,我之后将该集转换为列表,因为我需要修复列表的顺序以进行进一步处理。

After some comparison, it seems like list(set(chain(*x))) is more stable and takes less time: 经过一些比较,看起来像list(set(chain(*x)))更稳定,花费的时间更少:

from itertools import chain
import time
import random

# Dry run.
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)]
list(set().union(*x))
list(set(chain(*x)))

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]
    start = time.time()
    y = list(set().union(*x))
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time
    start = time.time()
    z = list(set(chain(*x)))
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time
    assert sorted(y) == sorted(z)
    #print 

print y_time / 1000.
print z_time / 1000. 

[out]: [OUT]:

1.39586925507e-05
1.09834671021e-05

Taking out the variable of casting sets to list: 取出铸造集的变量列表:

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]

    start = time.time()
    y = set().union(*x)
    y_time += time.time() - start 

    start = time.time()
    z = set(chain(*x))
    z_time += time.time() - start 

    assert sorted(y) == sorted(z)

print y_time / 1000.
print z_time / 1000. 

[out]: [OUT]:

1.22241973877e-05
1.02684497833e-05

Here's the full output when I try to print the intermediate timings (without list casting): http://pastebin.com/raw/y3i6dXZ8 这是我尝试打印中间时间时的完整输出(没有列表转换): http//pastebin.com/raw/y3i6dXZ8

Why is it that list(set(chain(*x))) takes less time than list(set().union(*x)) ? 为什么list(set(chain(*x)))list(set().union(*x))花费的时间少?

Is there another way of achieving the same union of lists? 是否有其他方法可以实现相同的列表联合? Using numpy or pandas or sframe or something? 使用numpypandassframe或什么? Is the alternative faster? 替代方案更快吗?

What's fastest depends on the nature of x -- whether it is a long list or a short list, with many sublists or few sublists, whether the sublists are long or short, and whether there are many duplicates or few duplicates. 什么是最快的取决于x的性质 - 无论是长列表还是短列表,有多个子列表或少数子列表,子列表是长还是短,以及是否有许多重复或少量重复。

Here are some timeit results comparing some alternatives. 以下是比较一些替代方案的一些时间结果。 There are so many possibilities that this is by no means a complete analysis, but perhaps this will give you a framework for studying your use case. 有很多可能性,这绝不是一个完整的分析,但也许这将为您提供一个研究您的用例的框架。

func                 | x                    | time
unique_concatenate   | many_uniques         | 0.863
empty_set_union      | many_uniques         | 1.191
short_set_union_rest | many_uniques         | 1.192
long_set_union_rest  | many_uniques         | 1.194
set_chain            | many_uniques         | 1.224

func                 | x                    | time
long_set_union_rest  | many_duplicates      | 0.958
short_set_union_rest | many_duplicates      | 0.969
empty_set_union      | many_duplicates      | 0.971
set_chain            | many_duplicates      | 1.128
unique_concatenate   | many_duplicates      | 2.411

func                 | x                    | time
empty_set_union      | many_small_lists     | 1.023
long_set_union_rest  | many_small_lists     | 1.028
set_chain            | many_small_lists     | 1.032
short_set_union_rest | many_small_lists     | 1.036
unique_concatenate   | many_small_lists     | 1.351

func                 | x                    | time
long_set_union_rest  | few_large_lists      | 0.791
empty_set_union      | few_large_lists      | 0.813
unique_concatenate   | few_large_lists      | 0.814
set_chain            | few_large_lists      | 0.829
short_set_union_rest | few_large_lists      | 0.849

Be sure to run the timeit benchmarks on your own machine since results may vary. 请务必在您自己的计算机上运行timeit基准测试,因为结果可能会有所不同。


from __future__ import print_function
import random
import timeit
from itertools import chain
import numpy as np

def unique_concatenate(x):
    return np.unique(np.concatenate(x))

def short_set_union_rest(x):
    # This assumes x[0] is the shortest list in x
    return list(set(x[0]).union(*x[1:]))

def long_set_union_rest(x):
    # This assumes x[-1] is the longest list in x
    return list(set(x[-1]).union(*x[1:]))

def empty_set_union(x):
    return list(set().union(*x))

def set_chain(x):
    return list(set(chain(*x)))


big_range = list(range(10**7))
small_range = list(range(10**5))
many_uniques = [[random.choice(big_range) for i in range(j)] 
                for j in range(10, 10000, 10)]
many_duplicates = [[random.choice(small_range) for i in range(j)] 
              for j in range(10, 10000, 10)]
many_small_lists = [[random.choice(big_range) for i in range(10)] 
                    for j in range(10, 10000, 10)]
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
                    for j in range(10, 100, 10)]

if __name__=='__main__':
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
                 ('many_small_lists', 800), ('few_large_lists', 800)]:
        timing = dict()
        for func in [
                'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest',
                'empty_set_union', 'set_chain']:
            timing[func, x] = timeit.timeit(
                '{}({})'.format(func, x), number=n,
                setup='from __main__ import {}, {}'.format(func, x))
        print('{:20} | {:20} | {}'.format('func', 'x', 'time'))
        for key, t in sorted(timing.items(), key=lambda item: item[1]):
            func, x = key
            print('{:20} | {:20} | {:.3f}'.format(func, x, t))
        print(end='\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM