简体   繁体   English

提高元组列表到dict的转换速度

[英]Improve conversion speed of list of tuples to dict

I have a list l consisting of tuples of length 5. The first four entries are strings, the last one is an integer. 我有一个包含长度为5的元组的列表l 。前四个条目是字符串,最后一个是整数。 A dummy function to create such a list may look as follows: 创建此类列表的虚拟函数可能如下所示:

import numpy as np
import uuid
def get_dummy_data(n=10000):
    l = []
    for i in range(n):
        name = np.random.choice(["Cat", "Dog", "Duck"], 1)[0]
        c_id = uuid.uuid4().hex
        t_id = uuid.uuid4().hex
        l.append((c_id, t_id, name, "canFly", 1))
        if np.random.random() < 0.8:
            l.append((c_id, t_id, name, "isHungry", 0))
    return l

Now this list l contains tuples which have identical first three elements but differ in the last two. 现在,此列表l包含元组,它们具有相同的前三个元素,但后两个元素不同。 This is exemplified here by appending the same tuple again with 80% chance but changing the last two elements. 此处通过以80%的机会再次添加相同的元组但更改了最后两个元素来举例说明。

The goal is to convert this list of length-5 tuples into a dictionary in which the key is the first entry of the tuple (c_id) and the value is structured like this (t_id, (name, {"isHungry":0})) or this: (t_id, (name, {"canFly":1, "isHungry":0})). 目标是将这个长度为5的元组列表转换为字典,其中键是元组的第一个条目(c_id),其值的结构如下(t_id,(name,{“ isHungry”:0}) )或以下内容:(t_id,(名称,{“ canFly”:1,“ isHungry”:0}))。

This can be achieved by the following loop: 这可以通过以下循环来实现:

res = {}
for y in l:
    if y[0] not in res:
        res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    else:
        res[y[0]][1][1].update({y[3]: y[4]}) 

The question is now: can I make this faster? 现在的问题是:我可以加快速度吗? There might be more than two tuples in the list l with the same c_id (in contrast to the get_dummy_data function) and we cannot assume any order in l . 列表l可能有两个以上具有相同c_id元组(与get_dummy_data函数相反),我们不能假设l任何顺序。 I always have a bad feeling when doing an explicit for loop to fill a dict so I bet there is a good way to make this faster. 当执行显式的for循环来填充字典时,我总是有一种不好的感觉,所以我敢打赌,有一种很好的方法可以使其更快。

You can do basic micro-optimizations that also make your code more readable. 您可以进行基本的微优化,这也可以使代码更具可读性。 A big one is not using some_dict.update({x:y}) instead of some_dict[x] = y . 一个大的不使用some_dict.update({x:y})而不是some_dict[x] = y But here's some timing differences: 但是这里有一些时间上的差异:

In [12]: %%timeit
    ...: res = {}
    ...: for y in data:
    ...:     if y[0] not in res:
    ...:         res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    ...:     else:
    ...:         res[y[0]][1][1].update({y[3]: y[4]})
    ...:
100 loops, best of 3: 15.3 ms per loop

In [13]: %%timeit
    ...: res = {}
    ...: for a,b,c,d,e in data:
    ...:     if a not in res:
    ...:         res[a] = (b, (c, {d: e}))
    ...:     else:
    ...:         res[a][1][1][d] = e
    ...:
100 loops, best of 3: 11 ms per loop

Here it is with .update . 这里是.update Note, each y[...] is a method-call, which slows things down. 注意,每个y[...]是一个方法调用,它会使事情变慢。 But the biggest component of the time savings was avoiding the .update({...} . Note, that approach requires the creation of a whole dict object for no good reason: 但是,节省时间的最大组成部分是避免使用.update({...} 。注意,这种方法.update({...}需要创建整个dict对象:

In [18]: %%timeit
    ...: res = {}
    ...: for a,b,c,d,e in data:
    ...:     if a not in res:
    ...:         res[a] = (b, (c, {d: e}))
    ...:     else:
    ...:         res[a][1][1].update({d:e})
    ...:
100 loops, best of 3: 13.8 ms per loop

this kind of loop is generally slow: 这种循环通常很慢:

res = {}
for y in l:
    if y[0] not in res:
        res[y[0]] = (y[1], (y[2], {y[3]: y[4]}))
    else:
        res[y[0]][1][1].update({y[3]: y[4]}) 

because you're testing if the key belongs to the dictionary twice and there's the if/else statement. 因为您要测试密钥是否两次属于字典,并且有if/else语句。

I would use the binding property of variables in lambda & unpacking (borrowed from juanpa answer): 我会在lambda和拆包中使用变量的绑定属性(从juanpa答案中借用):

import collections
res = collections.defaultdict(lambda : (b, (c, {d: e})))

for a,b,c,d,e in l:
    res[a][1][1][d] = e

if key isn't in dictionary defaultdict creates a key using the current value of a , b ..., (thanks to lambda evaluating the values when executing, not when declaring) saving the test and creating the proper key each time. 如果key不在字典中,则defaultdict使用ab ...的当前值创建密钥(这是由于lambda在执行时(而不是在声明时)评估值)保存测试并每次创建正确的密钥。 Now the update part is a bit redundant but it still should be faster because there's no if/then test. 现在update部分有点多余,但是它仍然应该更快,因为没有if/then测试。

This solution is faster than juanpa (already good) answer on my machine (0.23 seconds vs 0.27 seconds). 这个解决方案比我机器上的juanpa(已经不错)的答案要快(0.23秒vs. 0.27秒)。 I would call that a good collaborating effort since my first version was slower. 因为我的第一个版本比较慢,所以我称这是一次很好的协作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM