簡體   English   中英

從字符串或列表創建字典

[英]Create dict from a string or list

背景

我想為給定的字符串或給定的列表生成一個哈希表。 哈希表將元素視為key ,將出現時間視為value 例如:

s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}

我的嘗試

# method 1
from collections import Counter
s = 'ababcd' 
hash_table1 = Counter(s)

# method 2
s = 'ababdc' 
hash_table2 = dict()
for i in s:
    if hash_table2.get(i) == None:
        hash_table2[i] = 1
    else:
        hash_table2[i] += 1
hash_table1 == hash_table2

真的

通常,我使用上述兩種方法。 一種來自標准庫,但在某些代碼實踐站點中是不允許的。 另一個是從頭開始寫的,但我認為它太長了。 如果我使用 dict comprehension,我想出了 2 個額外的方法:

{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}

我想知道是否還有其他方法可以清晰或有效地從列表字符串初始化哈希表?

我提到的 4 種方法的速度比較

from collections import Counter
import random,string,numpy,perfplot

def from_set(s):
    return {i:s.count(i) for i in set(s)}

def from_string(s):
    return {i:s.count(i) for i in s}

def handy(s):
    hash_table2 = dict()
    for i in s:
        if hash_table2.get(i) == None:
            hash_table2[i] = 1
        else:
            hash_table2[i] += 1
    return hash_table2

def counter(s):
    return Counter(s)

perfplot.show(
    setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set,from_string,handy,counter],
    labels=['set','string','handy','counter'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

方法的速度比較

最好的方法是使用內置計數器,否則,您可以使用 defualtdict 這與您的第二次嘗試非常相似

from collections import defualtdict

d = defualtdict(int) # this makes every value 0 by defualt
for letter in string:
    d[letter] +=1

我通常使用Counterdefaultdict來創建出現頻率。

令人驚訝的是,發帖者的 from_set 方法在大多數情況下都優於兩者。

觀察

  1. from_set(標記為“set”)整體表現最佳
  2. 各種字典方法只適用於較小的字符串長度(即 < 100)
  3. Counter 方法僅適用於小范圍的字符串長度。
  4. 對於大字符串,from_set 比 defaultdict 快 2.3 倍,比 Counter 快 1.5 倍

算法

from collections import Counter
from collections import defaultdict

import random,string,numpy,perfplot

def from_set(s):
    " Use builtin count function for each item in set "
    return {i:s.count(i) for i in set(s)}

def counter(s):
    " Uses counter module "
    return Counter(s)

def normal_dic(s):
  " Update dictionary by checking if item in it or not "
  d = {}
  for i in s:
    if i in d:
      d[i] += 1
    else:
      d[i] = 1

  return d

def setdefault_dic(s):
  " Use setdefault to preset unknown keys "
  d = {}
  for i in s:
    d.setdefault(i, 0)
    d[i] += 1

  return d

def default_dic(s):
    " Used defaultdict from collections module "
    d = defaultdict(int)
    for i in s:
        d[i] += 1
    return d

def try_dic(s):
    " Use try/except to check if item in dictionary "
    d = {}
    for i in s:
        try:
            d[i] += 1
        except:
            d[i] = 1

    return d

測試代碼

使用 Perfplot 模塊

out = perfplot.bench(
   setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
    labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
    )
out.show()
#out.save("perf.png")
out

圖表

絕對值

from_set 圖中的標簽“set”。 在下面的相對圖上比較性能比這個絕對圖更容易。

絕對值

相對值

from_set 圖中的標簽“set”。

from_set 方法是水平線。 對於較大的值,包括 Counter 和 defaultdict 在內的所有其他方法都高於它(更耗時)。

相對值

桌子

實際時間

       n  setdefault     try_dic  defaultdict    counter    from_set
     1.0       799.0       899.0       1299.0     6099.0     1399.0
     2.0      1099.0      1199.0       1599.0     6299.0     1699.0
     4.0      1699.0      1699.0       2199.0     6299.0     2399.0
     8.0      3199.0      3099.0       3499.0     6899.0     3699.0
    16.0      6099.0      5499.0       5899.0     7899.0     5900.0
    32.0     10899.0      9299.0       9899.0     8999.0    10299.0
    64.0     20799.0     15599.0      15999.0    11899.0    15099.0
   128.0     38499.0     25499.0      25899.0    16599.0    21899.0
   256.0     73100.0     44099.0      42700.0    26299.0    30299.0
   512.0    137999.0     77099.0      72699.0    43199.0    46699.0
  1024.0    286599.0    154500.0     144099.0    85700.0    79699.0
  2048.0    549700.0    289999.0     266799.0   157499.0   145699.0
  4096.0   1103899.0    577399.0     535499.0   309399.0   278999.0
  8192.0   2200099.0   1151500.0    1051799.0   606999.0   542499.0
 16384.0   4658199.0   2534399.0    2295300.0  1414199.0  1087799.0
 32768.0   9509200.0   5270200.0    4838000.0  3066600.0  2177200.0
 65536.0  19539500.0  10806300.0    9942100.0  6503299.0  4337599.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM