在增加現有值的同時向字典添加新鍵

Question

我正在處理一個CSV文件並計算第4列的唯一值。到目前為止，我已經用這三種方式編碼了。 一個使用“if key in dictionary”，第二個使用KeyError，第三個使用“DefaultDictionary”。 例如（其中x [3]是文件中的值，“a”是字典）：

第一種方式：

if x[3] in a:
    a[x[3]] += 1
else:
    a[x[3]] = 1

第二種方式：

try:
    b[x[3]] += 1
except KeyError:
    b[x[3]] = 1

第三種方式：

from collections import defaultdict
c = defaultdict(int)
c[x[3]] += 1

我的問題是：哪種方式更有效......更干凈......更好......等等還是有更好的方法。 這兩種方式都有效，並給出相同的答案，但我認為我會將蜂巢思維作為一個學習案例。

謝謝 -

Answer 1

使用collections.Counter 。 Counter是defaultdict(int)語法糖，但它很酷的是它在構造函數中接受了一個iterable，從而節省了一個額外的步驟（我假設你上面的所有例子都包含在for循環中。）

from collections import Counter
count = Counter(x[3] for x in my_csv_reader)

在引入collections.Counter之前， collections.defaultdict是此任務最慣用的，因此對於<2.7的用戶，請使用defaultdict 。

from collections import defaultdict
count = defaultdict(int)
for x in my_csv_reader:
    count[x[3]] += 1

Answer 2

你問哪個更有效率。 假設您正在談論執行速度：如果您的數據很小，則無關緊要。 如果它很大且很典型，那么“已經存在”的情況將比“不在字典”的情況更頻繁地發生。 這一觀察結果解釋了一些結果。

下面是一些可以與timeit模塊一起使用的代碼，用於探索速度而無需文件讀取開銷。 我冒昧地添加了第5種方法，這種方法並不是無競爭性的，並且可以在至少1.5.2 [測試]之后的任何Python上運行。

from collections import defaultdict, Counter

def tally0(iterable):
    # DOESN'T WORK -- common base case for timing
    d = {}
    for item in iterable:
        d[item] = 1
    return d

def tally1(iterable):
    d = {}
    for item in iterable:
        if item in d:
            d[item] += 1
        else:
            d[item] = 1
    return d

def tally2(iterable):
    d = {}
    for item in iterable:
        try:
            d[item] += 1
        except KeyError:
            d[item] = 1
    return d

def tally3(iterable):
    d = defaultdict(int)
    for item in iterable:
        d[item] += 1

def tally4(iterable):
    d = Counter()
    for item in iterable:
        d[item] += 1

def tally5(iterable):
    d = {}
    dg = d.get
    for item in iterable:
        d[item] = dg(item, 0) + 1
    return d

典型運行（在Windows XP“命令提示符”窗口中）：

prompt>\python27\python -mtimeit -s"t=1000*'now is the winter of our discontent made glorious summer by this son of york';import tally_bench as tb" "tb.tally1(t)"
10 loops, best of 3: 29.5 msec per loop

以下是結果（每循環毫秒）：

0 base case   13.6
1 if k in d   29.5
2 try/except  26.1
3 defaultdict 23.4
4 Counter     79.4
5 d.get(k, 0) 29.2

另一個計時試驗：

prompt>\python27\python -mtimeit -s"from collections import defaultdict;d=defaultdict(int)" "d[1]+=1"
1000000 loops, best of 3: 0.309 usec per loop

prompt>\python27\python -mtimeit -s"from collections import Counter;d=Counter()" "d[1]+=1"
1000000 loops, best of 3: 1.02 usec per loop

Counter的速度可能是由於它部分在Python代碼中實現，而defaultdict完全在C中（至少在2.7中）。

請注意， Counter()不僅僅是defaultdict(int) “語法糖” - 它實現了一個完整的bag也稱為multiset對象 - 請參閱文檔了解詳細信息; 如果你需要一些花哨的后處理，它們可能會讓你免於重新發明輪子。 如果你想做的就是數數，請使用defaultdict 。

響應來自@Steven Rumbalski的問題更新：“”“我很好奇，如果你將迭代器移動到Counter構造函數中會發生什么：d = Counter（可迭代）？（我有python 2.6並且無法測試它。）” “”

tally6：只做d = Count(iterable); return d d = Count(iterable); return d ，需要60.0毫秒

您可以查看源代碼（SVN存儲庫中的collections.py）...這是我的Python27\\Lib\\collections.py在iterable不是Mapping實例時Python27\\Lib\\collections.py的事情：

            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1

以前在任何地方看過那段代碼？ 只需調用可在Python 1.5.2中運行的代碼，就會有大量的隨身攜帶:-O

Answer 3

from collections import Counter
Counter(a)

Answer 4

由於您無法訪問計數器，因此您最好的選擇是第三種方法。 它更清潔，更容易閱讀。 此外，它沒有前兩種方法的永久測試（和分支），這使得它更有效。

Answer 5

使用setdefault 。

a[x[3]] = a.setdefault(x[3], 0) + 1

setdefault獲取指定鍵的值（在本例中為x[3] ），或者如果不存在，則獲取指定的值（在本例中為0 ）。

在增加現有值的同時向字典添加新鍵

問題描述

5 個解決方案

解決方案1
6 已采納 2010-10-27 18:42:50

解決方案2
6 2010-10-27 20:37:34

解決方案3
1 2010-10-27 18:34:44

解決方案4
0 2010-10-27 18:39:14

解決方案5
0 2010-10-27 23:00:13

在增加現有值的同時向字典添加新鍵

問題描述

5 個解決方案

解決方案1 6 已采納 2010-10-27 18:42:50

解決方案2 6 2010-10-27 20:37:34

解決方案3 1 2010-10-27 18:34:44

解決方案4 0 2010-10-27 18:39:14

解決方案5 0 2010-10-27 23:00:13

解決方案1
6 已采納 2010-10-27 18:42:50

解決方案2
6 2010-10-27 20:37:34

解決方案3
1 2010-10-27 18:34:44

解決方案4
0 2010-10-27 18:39:14

解決方案5
0 2010-10-27 23:00:13