[英]Improve conversion speed of list of tuples to dict
I have a list l
consisting of tuples of length 5. The first four entries are strings, the last one is an integer. 我有一个包含长度为5的元组的列表
l
。前四个条目是字符串,最后一个是整数。 A dummy function to create such a list may look as follows: 创建此类列表的虚拟函数可能如下所示:
import numpy as np
import uuid
def get_dummy_data(n=10000):
l = []
for i in range(n):
name = np.random.choice(["Cat", "Dog", "Duck"], 1)[0]
c_id = uuid.uuid4().hex
t_id = uuid.uuid4().hex
l.append((c_id, t_id, name, "canFly", 1))
if np.random.random() < 0.8:
l.append((c_id, t_id, name, "isHungry", 0))
return l
Now this list l
contains tuples which have identical first three elements but differ in the last two. 现在,此列表
l
包含元组,它们具有相同的前三个元素,但后两个元素不同。 This is exemplified here by appending the same tuple again with 80% chance but changing the last two elements. 此处通过以80%的机会再次添加相同的元组但更改了最后两个元素来举例说明。
The goal is to convert this list of length-5 tuples into a dictionary in which the key is the first entry of the tuple (c_id) and the value is structured like this (t_id, (name, {"isHungry":0})) or this: (t_id, (name, {"canFly":1, "isHungry":0})). 目标是将这个长度为5的元组列表转换为字典,其中键是元组的第一个条目(c_id),其值的结构如下(t_id,(name,{“ isHungry”:0}) )或以下内容:(t_id,(名称,{“ canFly”:1,“ isHungry”:0}))。
This can be achieved by the following loop: 这可以通过以下循环来实现:
res = {}
for y in l:
if y[0] not in res:
res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
else:
res[y[0]][1][1].update({y[3]: y[4]})
The question is now: can I make this faster? 现在的问题是:我可以加快速度吗? There might be more than two tuples in the list
l
with the same c_id
(in contrast to the get_dummy_data function) and we cannot assume any order in l
. 列表
l
可能有两个以上具有相同c_id
元组(与get_dummy_data函数相反),我们不能假设l
任何顺序。 I always have a bad feeling when doing an explicit for loop to fill a dict so I bet there is a good way to make this faster. 当执行显式的for循环来填充字典时,我总是有一种不好的感觉,所以我敢打赌,有一种很好的方法可以使其更快。
You can do basic micro-optimizations that also make your code more readable. 您可以进行基本的微优化,这也可以使代码更具可读性。 A big one is not using
some_dict.update({x:y})
instead of some_dict[x] = y
. 一个大的不使用
some_dict.update({x:y})
而不是some_dict[x] = y
。 But here's some timing differences: 但是这里有一些时间上的差异:
In [12]: %%timeit
...: res = {}
...: for y in data:
...: if y[0] not in res:
...: res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
...: else:
...: res[y[0]][1][1].update({y[3]: y[4]})
...:
100 loops, best of 3: 15.3 ms per loop
In [13]: %%timeit
...: res = {}
...: for a,b,c,d,e in data:
...: if a not in res:
...: res[a] = (b, (c, {d: e}))
...: else:
...: res[a][1][1][d] = e
...:
100 loops, best of 3: 11 ms per loop
Here it is with .update
. 这里是
.update
。 Note, each y[...]
is a method-call, which slows things down. 注意,每个
y[...]
是一个方法调用,它会使事情变慢。 But the biggest component of the time savings was avoiding the .update({...}
. Note, that approach requires the creation of a whole dict
object for no good reason: 但是,节省时间的最大组成部分是避免使用
.update({...}
。注意,这种方法.update({...}
需要创建整个dict
对象:
In [18]: %%timeit
...: res = {}
...: for a,b,c,d,e in data:
...: if a not in res:
...: res[a] = (b, (c, {d: e}))
...: else:
...: res[a][1][1].update({d:e})
...:
100 loops, best of 3: 13.8 ms per loop
this kind of loop is generally slow: 这种循环通常很慢:
res = {}
for y in l:
if y[0] not in res:
res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
else:
res[y[0]][1][1].update({y[3]: y[4]})
because you're testing if the key belongs to the dictionary twice and there's the if/else
statement. 因为您要测试密钥是否两次属于字典,并且有
if/else
语句。
I would use the binding property of variables in lambda
& unpacking (borrowed from juanpa answer): 我会在
lambda
和拆包中使用变量的绑定属性(从juanpa答案中借用):
import collections
res = collections.defaultdict(lambda : (b, (c, {d: e})))
for a,b,c,d,e in l:
res[a][1][1][d] = e
if key isn't in dictionary defaultdict
creates a key using the current value of a
, b
..., (thanks to lambda
evaluating the values when executing, not when declaring) saving the test and creating the proper key each time. 如果key不在字典中,则
defaultdict
使用a
, b
...的当前值创建密钥(这是由于lambda
在执行时(而不是在声明时)评估值)保存测试并每次创建正确的密钥。 Now the update
part is a bit redundant but it still should be faster because there's no if/then
test. 现在
update
部分有点多余,但是它仍然应该更快,因为没有if/then
测试。
This solution is faster than juanpa (already good) answer on my machine (0.23 seconds vs 0.27 seconds). 这个解决方案比我机器上的juanpa(已经不错)的答案要快(0.23秒vs. 0.27秒)。 I would call that a good collaborating effort since my first version was slower.
因为我的第一个版本比较慢,所以我称这是一次很好的协作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.