简体   繁体   English

根据列表中的第一个元素制作字典

[英]make a dictionary out of first elements in a list of list

This is a question about performance of using set() on list comprehension inside dictionary comprehension Vs dictionary comprehension and repeated assignment to a new dictionary 这是一个关于在字典理解内对列表理解与字典理解以及对新字典重复分配的列表理解上使用set()的性能的问题

So I happen to have a dataset which is a list of lists and i need to get a unique list of elements that are indexed at '0' in each of those lists inside the big list, so as to be able to make a new dictionary from them.. something like dict.fromkeys() .. here I need to supply list of unique keys.. 因此,我碰巧有一个数据集,它是一个列表列表,我需要获取一个唯一的元素列表,这些元素在大列表内的每个列表中索引为“ 0”,以便能够创建新字典从他们..类似dict.fromkeys()的东西..在这里,我需要提供唯一键的列表。

I'm using 我正在使用

[1] { x : [] for x in set([i[0] for i in data])} [1] { x : [] for x in set([i[0] for i in data])}

and also using 并使用

[2] { i[0] : [] for i in data} [2] { i[0] : [] for i in data}

sample data for reference here could be like: [[1,3,4], [3,5,2], [1,5,2]] 供此处参考的样本数据可能类似于: [[1,3,4], [3,5,2], [1,5,2]]

the result from running [1] and [2] above would then be: { 1:[], 3: [] } 从上面运行[1]和[2]的结果将是: { 1:[], 3: [] }

I tried %timeit on both statements and both give nearly same results which makes it difficult to identify which one is best, performance-wise, for big list of lists 我在这两个语句上都尝试了%timeit,并且都给出了几乎相同的结果,这使得很难确定哪个是最佳的(从性能角度来看),这对于大型列表而言

How do I identify a potential bottleneck here? 我如何确定这里的潜在瓶颈?

EDIT: 编辑:

If this helps in explaining the results.. 如果这有助于解释结果。

In [172]: data_new = data * 10000

In [173]: %timeit { i[0] : [] for i in data_new}
10 loops, best of 3: 160 ms per loop

In [174]: %timeit { x : [] for x in set([i[0] for i in data_new])}
10 loops, best of 3: 131 ms per loop

In [175]: data_new = data * 100000

In [176]: %timeit { x : [] for x in set([i[0] for i in data_new])}
1 loops, best of 3: 1.37 s per loop

In [177]: %timeit { i[0] : [] for i in data_new}
1 loops, best of 3: 1.58 s per loop

In [178]: data_new = data * 1000000

In [179]: %timeit { i[0] : [] for i in data_new}
1 loops, best of 3: 15.8 s per loop

In [180]: %timeit { x : [] for x in set([i[0] for i in data_new])}
1 loops, best of 3: 13.6 s per loop

Build a larger dataset, then timeit: 建立一个更大的数据集,然后设置时间:

Code: 码:

import random
data = [ [random.randint(1, 9) for _ in range(3)] for _ in range(1000000)]

Timings: 时间:

%timeit { x : [] for x in set([i[0] for i in data])}
# 10 loops, best of 3: 94.6 ms per loop
%timeit { i[0] : [] for i in data}
# 10 loops, best of 3: 106 ms per loop
%timeit { x: [] for x in set(i[0] for i in data)}
# 10 loops, best of 3: 114 ms per loop
%timeit { x: [] for x in {i[0] for i in data}}
# 10 loops, best of 3: 77.7 ms per loop

Rationale : 理由

Limiting the available key space first means the dictionary only has to assign (given the randint above) 9 unique keys to 9 new lists. 首先限制可用键空间意味着字典仅需将9个唯一键分配(给上述randint )给9个新列表。 When using a dict comp, the dictionary has to repeatedly create and re-assign the value of its key to a newly created list... The difference is overhead in the deallocation of discarded empty lists (being garbage collected) and the time spent to create a new empty list. 使用dict comp时,字典必须重复创建其键值并将其值重新分配给新创建的列表...区别在于废弃空列表(正在被垃圾回收)的重新分配的开销以及花费在该列表上的时间。创建一个新的空列表。

Given a uniform distribution from randint , then there's 111,111 allocations and de-allocations of empty lists for 9 unique values over a set of 1,000,000 elements -- that's a lot more than just 9. 给定randint的均匀分布,那么在1,000,000个元素的集合中有randint个唯一值的空列表有111,111个分配和取消分配-远远超过9个。

It depends on how many duplicates you expect. 这取决于您期望多少重复。 In the shorter code, the empty list is created for each item in the input list, and this is surprisingly expensive. 在较短的代码中,将为输入列表中的每个项目创建一个空列表,这非常昂贵。 Use a static value, and the shorter becomes faster. 使用静态值,并且越短越快。

In the following, L = [[1,3,4], [3,5,2], [1,5,2]] * 100000 在下面, L = [[1,3,4], [3,5,2], [1,5,2]] * 100000

In [1]: %timeit { x : [] for x in {i[0] for i in L]}}
10 loops, best of 3: 58.9 ms per loop

In [2]: %timeit { i[0] : [] for i in L}
10 loops, best of 3: 68.1 ms per loop

Now test with the constant None value here: 现在在此处使用常量None值进行测试:

In [3]: %timeit { x : None for x in set([i[0] for i in L])}
10 loops, best of 3: 59 ms per loop

In [4]: %timeit { i[0] : None for i in L}
10 loops, best of 3: 54.3 ms per loop

Thus the needless list creation makes the shorter one perform slowly, whereas it is absolutely faster with constant values. 因此,不必要的列表创建会使较短的列表执行缓慢,而使用恒定值绝对会更快。


I didn't have ipython for Python 2 and I am a bit lazy timing this, but you would want to notice that Python 2.7 supports set comprehensions , which at least on Python 3.4 are much faster than creating sets from lists: 我没有适用于Python 2的ipython,我对此有点懒惰,但是您可能会注意到Python 2.7支持集合理解 ,至少在Python 3.4上这比从列表创建集合要快得多:

In [7]: %timeit { x : [] for x in {i[0] for i in L}}
10 loops, best of 3: 48.9 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM