如何有效地计算python列表中的共现

Question

我有一个相对较大（约3GB，300万条目）的子列表列表，其中每个子列表包含一组标签。 这是一个非常简单的例子：

tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]  

unique_tags = ['dog', 'cat', 'fish'] 
co_occurences = {key:Counter() for key in unique_tags}

for tags in tag_corpus: 
    tallies = Counter(tags)
    for key in tags: 
        co_occurences[key] = co_occurences[key] + tallies

这就像魅力一样，但它在实际数据集上的速度很慢，它有很大的子列表（总共30K左右的唯一标签）。 任何python专业人员都知道如何加速这件事吗？

Answer 1

这可能会更快。 你必须衡量。

from collections import Counter
from collections import defaultdict

tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]

co_occurences = defaultdict(Counter)
for tags in tag_corpus:
    for key in tags:
        co_occurences[key].update(tags)
unique_tags = sorted(co_occurences)

print co_occurences
print unique_tags

Answer 2

我只是在喋喋不休，不期望最终得到更高效的东西，但是拥有100000只猫，狗和鱼，这个速度要快得多，时间为0.07秒而不是1.25秒。

我试图最终得到一个更短的解决方案，但事实证明这种方式是最快的，即使它看起来很简单:)

occurances = {}
for tags in tag_corpus:
    for key in tags:
        for key2 in tags:
            try:
                occurances[key][key2] += 1
            except KeyError:
                try:
                    occurances[key][key2] = 1
                except KeyError:
                    occurances[key] = {key2: 1}

Answer 3

您可以尝试使用defaultdict进行组合以避免在使用Peters回答的逻辑时在开始时进行初始化，运行时将显着更快：

In [35]: %%timeit
co_occurences = defaultdict(Counter)
for tags in tag_corpus:
    for key in tags:
        co_occurences[key].update(tags)
   ....: 

1 loop, best of 3: 513 ms per loop

In [36]: %%timeit
occurances = {k1: defaultdict(int) for k1 in unique_tags}
for tags in tag_corpus:
    for key in tags:
        for key2 in tags:
            occurances[key][key2] += 1
   ....: 
10 loops, best of 3: 65.7 ms per loop

In [37]: %%timeit
   ....: co_occurences = {key:Counter() for key in unique_tags}
   ....: for tags in tag_corpus: 
   ....:     tallies = Counter(tags)
   ....:     for key in tags: 
   ....:         co_occurences[key] = co_occurences[key] + tallies
   ....: 
 1 loop, best of 3: 1.13 s per loop
    In [38]: %%timeit
   ....: occurances = defaultdict(lambda: defaultdict(int))
   ....: for tags in tag_corpus:
   ....:     for key in tags:
   ....:         for key2 in tags:
   ....:             occurances[key][key2] += 1
   ....: 
10 loops, best of 3: 66.5 ms per loop

至少在python2中， Counter dict实际上并不是获得计数的最快方法，但是即使使用lambda，默认也很快。

即使滚动自己的计数功能也会更快：

def count(x):
    d = defaultdict(int)
    for ele in x:
        d[ele] += 1
    return d

不是最快但仍然很好的快：

In [42]: %%timeit
   ....: co_occurences = {key: defaultdict(int) for key in unique_tags}
   ....: for tags in tag_corpus:
   ....:     tallies = count(tags)
   ....:     for key in tags:
   ....:         for k, v in tallies.items():
   ....:             co_occurences[key][k] += v
   ....: 

10 loops, best of 3: 164 ms per loop

如果你想要更多的加速，一点cython可能会走很长的路。

如何有效地计算python列表中的共现

问题描述

3 个解决方案

解决方案1
1 已采纳 2016-04-15 22:26:13

解决方案2
1 2016-04-15 22:49:59

解决方案3
1 2016-04-15 23:54:04

如何有效地计算python列表中的共现

问题描述

3 个解决方案

解决方案1 1 已采纳 2016-04-15 22:26:13

解决方案2 1 2016-04-15 22:49:59

解决方案3 1 2016-04-15 23:54:04

解决方案1
1 已采纳 2016-04-15 22:26:13

解决方案2
1 2016-04-15 22:49:59

解决方案3
1 2016-04-15 23:54:04