提高元组列表到dict的转换速度

Question

I have a list l consisting of tuples of length 5. The first four entries are strings, the last one is an integer. 我有一个包含长度为5的元组的列表l 。前四个条目是字符串，最后一个是整数。 A dummy function to create such a list may look as follows: 创建此类列表的虚拟函数可能如下所示：

import numpy as np
import uuid
def get_dummy_data(n=10000):
    l = []
    for i in range(n):
        name = np.random.choice(["Cat", "Dog", "Duck"], 1)[0]
        c_id = uuid.uuid4().hex
        t_id = uuid.uuid4().hex
        l.append((c_id, t_id, name, "canFly", 1))
        if np.random.random() < 0.8:
            l.append((c_id, t_id, name, "isHungry", 0))
    return l

Now this list l contains tuples which have identical first three elements but differ in the last two. 现在，此列表l包含元组，它们具有相同的前三个元素，但后两个元素不同。 This is exemplified here by appending the same tuple again with 80% chance but changing the last two elements. 此处通过以80％的机会再次添加相同的元组但更改了最后两个元素来举例说明。

The goal is to convert this list of length-5 tuples into a dictionary in which the key is the first entry of the tuple (c_id) and the value is structured like this (t_id, (name, {"isHungry":0})) or this: (t_id, (name, {"canFly":1, "isHungry":0})). 目标是将这个长度为5的元组列表转换为字典，其中键是元组的第一个条目（c_id），其值的结构如下（t_id，（name，{“ isHungry”：0}））或以下内容：（t_id，（名称，{“ canFly”：1，“ isHungry”：0}））。

This can be achieved by the following loop: 这可以通过以下循环来实现：

res = {}
for y in l:
    if y[0] not in res:
        res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    else:
        res[y[0]][1][1].update({y[3]: y[4]})

The question is now: can I make this faster? 现在的问题是：我可以加快速度吗？ There might be more than two tuples in the list l with the same c_id (in contrast to the get_dummy_data function) and we cannot assume any order in l . 列表l可能有两个以上具有相同c_id元组（与get_dummy_data函数相反），我们不能假设l任何顺序。 I always have a bad feeling when doing an explicit for loop to fill a dict so I bet there is a good way to make this faster. 当执行显式的for循环来填充字典时，我总是有一种不好的感觉，所以我敢打赌，有一种很好的方法可以使其更快。

Answer 1

You can do basic micro-optimizations that also make your code more readable. 您可以进行基本的微优化，这也可以使代码更具可读性。 A big one is not using some_dict.update({x:y}) instead of some_dict[x] = y . 一个大的不使用some_dict.update({x:y})而不是some_dict[x] = y 。 But here's some timing differences: 但是这里有一些时间上的差异：

In [12]: %%timeit
    ...: res = {}
    ...: for y in data:
    ...:     if y[0] not in res:
    ...:         res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    ...:     else:
    ...:         res[y[0]][1][1].update({y[3]: y[4]})
    ...:
100 loops, best of 3: 15.3 ms per loop

In [13]: %%timeit
    ...: res = {}
    ...: for a,b,c,d,e in data:
    ...:     if a not in res:
    ...:         res[a] = (b, (c, {d: e}))
    ...:     else:
    ...:         res[a][1][1][d] = e
    ...:
100 loops, best of 3: 11 ms per loop

Here it is with .update . 这里是.update 。 Note, each y[...] is a method-call, which slows things down. 注意，每个y[...]是一个方法调用，它会使事情变慢。 But the biggest component of the time savings was avoiding the .update({...} . Note, that approach requires the creation of a whole dict object for no good reason: 但是，节省时间的最大组成部分是避免使用.update({...} 。注意，这种方法.update({...}需要创建整个dict对象：

In [18]: %%timeit
    ...: res = {}
    ...: for a,b,c,d,e in data:
    ...:     if a not in res:
    ...:         res[a] = (b, (c, {d: e}))
    ...:     else:
    ...:         res[a][1][1].update({d:e})
    ...:
100 loops, best of 3: 13.8 ms per loop

Answer 2

this kind of loop is generally slow: 这种循环通常很慢：

res = {}
for y in l:
    if y[0] not in res:
        res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    else:
        res[y[0]][1][1].update({y[3]: y[4]})

because you're testing if the key belongs to the dictionary twice and there's the if/else statement. 因为您要测试密钥是否两次属于字典，并且有if/else语句。

I would use the binding property of variables in lambda & unpacking (borrowed from juanpa answer): 我会在lambda和拆包中使用变量的绑定属性（从juanpa答案中借用）：

import collections
res = collections.defaultdict(lambda : (b, (c, {d: e})))

for a,b,c,d,e in l:
    res[a][1][1][d] = e

if key isn't in dictionary defaultdict creates a key using the current value of a , b ..., (thanks to lambda evaluating the values when executing, not when declaring) saving the test and creating the proper key each time. 如果key不在字典中，则defaultdict使用a ， b ...的当前值创建密钥（这是由于lambda在执行时（而不是在声明时）评估值）保存测试并每次创建正确的密钥。 Now the update part is a bit redundant but it still should be faster because there's no if/then test. 现在update部分有点多余，但是它仍然应该更快，因为没有if/then测试。

This solution is faster than juanpa (already good) answer on my machine (0.23 seconds vs 0.27 seconds). 这个解决方案比我机器上的juanpa（已经不错）的答案要快（0.23秒vs. 0.27秒）。 I would call that a good collaborating effort since my first version was slower. 因为我的第一个版本比较慢，所以我称这是一次很好的协作。

提高元组列表到dict的转换速度

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-04-26 20:49:39

解决方案2
1 2018-04-26 20:50:24

提高元组列表到dict的转换速度

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-04-26 20:49:39

解决方案2 1 2018-04-26 20:50:24

解决方案1
2 已采纳 2018-04-26 20:49:39

解决方案2
1 2018-04-26 20:50:24