繁体   English   中英

字典中的子字典。 使用字典理解创建另一个字典,或从原始字典的副本中删除键

[英]Sub-dictionary from dictionary. Create another dictionary with dict comprehension, or delete keys from a copy of the original dictionary

我有一个大字典(+3000 个键值),我需要不断地从原始字典生成子字典(没有办法绕过它)。 大多数情况下,每个子字典包含大约 300-600 个键值。

我想知道什么更节省时间(内存效率在这里并不重要):

  • 选项1:创建原始字典的副本,并删除不必要的键值

  • 选项 2:创建一个空字典,并使用列表推导用必要的键值填充它。

我想选项 2 更有意义,因为我们选择了一些键值(300-600),而不是删除很多键值(+2000),但我不知道两者的后台发生了什么案例,并且有一些违反直觉的案例。

谢谢

对两个答案的一点反馈; 恕我直言,这也不是准确的,因为他们选择键的方式。

他们有效地将字典视为列表并保留 n 项,如果您的数据设置为您可能不需要字典(而是排序列表)。

关于选项 2

我敢打赌你正在执行这样的查找:

new_dict = {k:v for k,v in parent_dict.items() if k in selections}

在这种情况下,性能会随着len(selections)的增加而降低; 在某种程度上减轻这种情况的一种方法是使用集合而不是列表进行选择; 有效O(m*n) (其中 m 是 parent_dict 的大小,n 是选择的数量)

因为 n 上的循环并不总是需要在完整 n 上循环,它可能更接近O((n**2-n)/2)其中(n**2-2)/2是1 到 n 之间的所有数字。

另一方面,选项 1中删除的循环是O(2m-n)

O(m)用于复制字典和O(mn)用于删除键

有一些非香草(我想到了 lodash)解决方案经过高度优化并且可能会更快,但严格考虑时间复杂度,使用现实的键选择似乎应该更快地删除。

哪个选项最好可能取决于m & n的大小


我刚刚有一个金发碧眼的时刻,但想把我的无知留给任何其他未来的路人

你应该这样做: new_dict = {k: parent_dict[k] for k in selections}

您无需为复制 dict 支付O(m)罚款,将项目添加到 dict 是O(1)并且循环选择是O(n)所以你坐在O(2n)O(n) -- 我认为没有什么比这更快了。


好的,下面是代码:

import random
import timeit

m = 3_000
n0 = 300
n1 = 600

trials = 10_000

def remove(parent_dict, kept_keys):
    result = parent_dict.copy()
    for key in parent_dict.keys():
        if key not in kept_keys:
            del result[key]
    return result

def add_dict_comprehension(parent_dict, kept_keys):
    return {k:v for k,v in parent_dict.items() if k in kept_keys}

def add_key_comprehension(parent_dict, kept_keys):
    return {k:parent_dict[k] for k in kept_keys}

def add_dict_comprehension_set(parent_dict, kept_keys):
    return {k:v for k,v in parent_dict.items() if k in kept_keys}

def generate_test_data(m, n0, n1):
    parent_dict = {}
    for key in range(m):
        parent_dict[key] = 0

    target_keys = random.sample(range(m), random.randint(n0, n1))

    return parent_dict, target_keys

test_functions = {
    'remove': (remove, lambda x: x),
    'add_dict_comprehension': (add_dict_comprehension, lambda x: x),
    'add_key_comprehension': (add_key_comprehension, lambda x: x),
    'add_dict_comprehension_set': (add_dict_comprehension, lambda keys: set(keys)),
    'remove_set': (remove, lambda keys: set(keys)),
}

results = {k:[] for k in test_functions.keys()}

for _ in range(trials):
    parent_dict, target_keys = generate_test_data(m, n0, n1)


    for k,(func, preprocessor) in test_functions.items():

        # Not timing the preprocessing, assuming this is done on construction
        processed_keys = preprocessor(target_keys)
        
        starttime = timeit.default_timer()
        func(parent_dict, processed_keys)
        results[k].append(timeit.default_timer() - starttime)


def avg(lst):
    return sum(lst)/len(lst)

def stddev(lst):
    mean = avg(lst)
    r2 = sum([(i-mean)**2 for i in lst])
    bias = len(lst) - 1
    return (r2/bias)**.5

result_data = {k: {'avg': avg(v), 'stddev': stddev(v)} for k,v in results.items()}

title_len = max(len(i) for i in result_data.keys())
for (name, data) in sorted(result_data.items(), key = lambda item: item[1]['avg']):
    print(f"{name}: {''.join([' ']*(title_len-len(name)))} Avg. {data['avg']} [Stddev: {data['stddev']}]")

正如预期的那样,键理解是最快的,并且对于选项 1 和 2,使用集合而不是列表更快)——集合的平均查找时间为O(1) ,因此您可以避免数组的大部分搜索成本

add_key_comprehension:       Avg. 8.616499952040612e-05 [Stddev: 4.227672839185718e-05]
add_dict_comprehension_set:  Avg. 0.0001553589993272908 [Stddev: 3.535997011243502e-05]
remove_set:                  Avg. 0.00021654199925251306 [Stddev: 0.00017905888806226246]
add_dict_comprehension:      Avg. 0.013669110001646913 [Stddev: 0.0031769500128693085]
remove:                      Avg. 0.0138856839996879 [Stddev: 0.004191406076651088]

选项 2看起来足够快 - 见下文

import random
import string
import timeit

letters = string.ascii_lowercase


def _get_word():
    return ''.join(random.choice(letters) for i in range(10))

N = 10000
CHUNK = 400
ITERATIONS = 30
# dummy dict with N entries
data = {_get_word(): _get_word() for _ in range(0, N)}


for _ in range(0,ITERATIONS):
    starttime = timeit.default_timer()

    sub_dicts = []
    temp = {}
    for idx,(k,v) in enumerate(data.items()):
        temp[k] = v
        if idx % CHUNK == 0:
            sub_dicts.append(temp)
            temp = {}
    print("Option 2: The time difference is :", timeit.default_timer() - starttime)

output(增量 T,以秒为单位)

Option 2: The time difference is : 0.003928793012164533
Option 2: The time difference is : 0.006472171982750297
Option 2: The time difference is : 0.006867899966891855
Option 2: The time difference is : 0.007105670985765755
Option 2: The time difference is : 0.004680399026256055
Option 2: The time difference is : 0.0034195330226793885
Option 2: The time difference is : 0.0034254349884577096
Option 2: The time difference is : 0.003625455021392554
Option 2: The time difference is : 0.003444572037551552
Option 2: The time difference is : 0.003441892040427774
Option 2: The time difference is : 0.0038558689993806183
Option 2: The time difference is : 0.00340640899958089
Option 2: The time difference is : 0.004021176020614803
Option 2: The time difference is : 0.004761706979479641
Option 2: The time difference is : 0.0039043189608491957
Option 2: The time difference is : 0.0035160399856977165
Option 2: The time difference is : 0.003446961985900998
Option 2: The time difference is : 0.0038189300103113055
Option 2: The time difference is : 0.003589348983950913
Option 2: The time difference is : 0.0033721639774739742
Option 2: The time difference is : 0.0033731560106389225
Option 2: The time difference is : 0.0033843390410766006
Option 2: The time difference is : 0.003322889970149845
Option 2: The time difference is : 0.0034315589582547545
Option 2: The time difference is : 0.003400936024263501
Option 2: The time difference is : 0.0032910890295170248
Option 2: The time difference is : 0.003319227951578796
Option 2: The time difference is : 0.003392132988665253
Option 2: The time difference is : 0.0032848399714566767
Option 2: The time difference is : 0.003325468976981938

选项 2 更快:

import timeit

# create a dummy dict
dict1 = {}
for key in range(30000):
    dict1[key] = 0


# both option keep the first 4500 key

# option 1 :
starttime = timeit.default_timer()
dict_op1 = dict1.copy()
for key, value in dict1.items():
    if key >= 4500:
        del dict_op1[key]
op1_time = timeit.default_timer() - starttime

# option 2 :
starttime = timeit.default_timer()
dict_op2 = {key: value for key, value in dict1.items() if key < 4500}
op2_time = timeit.default_timer() - starttime

print("option1 time:" + str(op1_time))
print("option2 time:" + str(op2_time))

output:

option1 time:0.0025474
option2 time:0.0009779

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM