繁体   English   中英

从字符串或列表创建字典

[英]Create dict from a string or list

背景

我想为给定的字符串或给定的列表生成一个哈希表。 哈希表将元素视为key ,将出现时间视为value 例如:

s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}

我的尝试

# method 1
from collections import Counter
s = 'ababcd' 
hash_table1 = Counter(s)

# method 2
s = 'ababdc' 
hash_table2 = dict()
for i in s:
    if hash_table2.get(i) == None:
        hash_table2[i] = 1
    else:
        hash_table2[i] += 1
hash_table1 == hash_table2

真的

通常,我使用上述两种方法。 一种来自标准库,但在某些代码实践站点中是不允许的。 另一个是从头开始写的,但我认为它太长了。 如果我使用 dict comprehension,我想出了 2 个额外的方法:

{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}

我想知道是否还有其他方法可以清晰或有效地从列表字符串初始化哈希表?

我提到的 4 种方法的速度比较

from collections import Counter
import random,string,numpy,perfplot

def from_set(s):
    return {i:s.count(i) for i in set(s)}

def from_string(s):
    return {i:s.count(i) for i in s}

def handy(s):
    hash_table2 = dict()
    for i in s:
        if hash_table2.get(i) == None:
            hash_table2[i] = 1
        else:
            hash_table2[i] += 1
    return hash_table2

def counter(s):
    return Counter(s)

perfplot.show(
    setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set,from_string,handy,counter],
    labels=['set','string','handy','counter'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

方法的速度比较

最好的方法是使用内置计数器,否则,您可以使用 defualtdict 这与您的第二次尝试非常相似

from collections import defualtdict

d = defualtdict(int) # this makes every value 0 by defualt
for letter in string:
    d[letter] +=1

我通常使用Counterdefaultdict来创建出现频率。

令人惊讶的是,发帖者的 from_set 方法在大多数情况下都优于两者。

观察

  1. from_set(标记为“set”)整体表现最佳
  2. 各种字典方法只适用于较小的字符串长度(即 < 100)
  3. Counter 方法仅适用于小范围的字符串长度。
  4. 对于大字符串,from_set 比 defaultdict 快 2.3 倍,比 Counter 快 1.5 倍

算法

from collections import Counter
from collections import defaultdict

import random,string,numpy,perfplot

def from_set(s):
    " Use builtin count function for each item in set "
    return {i:s.count(i) for i in set(s)}

def counter(s):
    " Uses counter module "
    return Counter(s)

def normal_dic(s):
  " Update dictionary by checking if item in it or not "
  d = {}
  for i in s:
    if i in d:
      d[i] += 1
    else:
      d[i] = 1

  return d

def setdefault_dic(s):
  " Use setdefault to preset unknown keys "
  d = {}
  for i in s:
    d.setdefault(i, 0)
    d[i] += 1

  return d

def default_dic(s):
    " Used defaultdict from collections module "
    d = defaultdict(int)
    for i in s:
        d[i] += 1
    return d

def try_dic(s):
    " Use try/except to check if item in dictionary "
    d = {}
    for i in s:
        try:
            d[i] += 1
        except:
            d[i] = 1

    return d

测试代码

使用 Perfplot 模块

out = perfplot.bench(
   setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
    labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
    )
out.show()
#out.save("perf.png")
out

图表

绝对值

from_set 图中的标签“set”。 在下面的相对图上比较性能比这个绝对图更容易。

绝对值

相对值

from_set 图中的标签“set”。

from_set 方法是水平线。 对于较大的值,包括 Counter 和 defaultdict 在内的所有其他方法都高于它(更耗时)。

相对值

桌子

实际时间

       n  setdefault     try_dic  defaultdict    counter    from_set
     1.0       799.0       899.0       1299.0     6099.0     1399.0
     2.0      1099.0      1199.0       1599.0     6299.0     1699.0
     4.0      1699.0      1699.0       2199.0     6299.0     2399.0
     8.0      3199.0      3099.0       3499.0     6899.0     3699.0
    16.0      6099.0      5499.0       5899.0     7899.0     5900.0
    32.0     10899.0      9299.0       9899.0     8999.0    10299.0
    64.0     20799.0     15599.0      15999.0    11899.0    15099.0
   128.0     38499.0     25499.0      25899.0    16599.0    21899.0
   256.0     73100.0     44099.0      42700.0    26299.0    30299.0
   512.0    137999.0     77099.0      72699.0    43199.0    46699.0
  1024.0    286599.0    154500.0     144099.0    85700.0    79699.0
  2048.0    549700.0    289999.0     266799.0   157499.0   145699.0
  4096.0   1103899.0    577399.0     535499.0   309399.0   278999.0
  8192.0   2200099.0   1151500.0    1051799.0   606999.0   542499.0
 16384.0   4658199.0   2534399.0    2295300.0  1414199.0  1087799.0
 32768.0   9509200.0   5270200.0    4838000.0  3066600.0  2177200.0
 65536.0  19539500.0  10806300.0    9942100.0  6503299.0  4337599.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM