获取列表中元素存在的概率的最快方法

Question

我创建了一个函数，该函数将返回包含输入列表元素以及从列表中选择该项目的可能性的字典：

from collections import Counter

def proba(x):
    n = len(x)
    return {key: val/n for key, val in dict(Counter(x)).items()}

有更快的解决方案吗？ 如果概率的输出顺序与元素的输入顺序相对应，则我不需要将输出设为kay：value对。

Answer 1

在评论Eelco的答案时，您写道

如果输入是np.random.randint（low = 0，high = 100，size = 50000）...

numpy_indexed有一些强大的工具，但是对于这样的数据，您可以使用numpy.bincount获得更好的性能：

In [11]: import numpy as np

In [12]: import numpy_indexed as npi

In [13]: x = np.random.randint(low=0, high=100, size=50000)

这是使用numpy.bincount的计算。 结果是一个长度为x.max()+1的数组。

In [14]: np.bincount(x)/len(x)
Out[14]: 
array([ 0.01066,  0.01022,  0.01048,  0.00994,  0.01026,  0.00972,
        0.0107 ,  0.00962,  0.0098 ,  0.00922,  0.00996,  0.01038,
        0.01024,  0.01118,  0.01012,  0.01098,  0.00988,  0.00996,
        0.00974,  0.0097 ,  0.00994,  0.01004,  0.0099 ,  0.01034,
        0.01066,  0.01032,  0.01042,  0.00896,  0.00958,  0.01008,
        0.01038,  0.00974,  0.01068,  0.00952,  0.00998,  0.00932,
        0.00964,  0.0103 ,  0.0099 ,  0.0093 ,  0.0101 ,  0.01012,
        0.0097 ,  0.00988,  0.0099 ,  0.01076,  0.01008,  0.0097 ,
        0.00986,  0.00998,  0.00976,  0.00984,  0.01008,  0.01008,
        0.00938,  0.00998,  0.00976,  0.0093 ,  0.00974,  0.00958,
        0.00984,  0.01032,  0.00988,  0.01014,  0.01088,  0.01006,
        0.0097 ,  0.01026,  0.00952,  0.01002,  0.00938,  0.01024,
        0.01024,  0.00984,  0.00922,  0.01044,  0.0101 ,  0.01052,
        0.01002,  0.00996,  0.0101 ,  0.00976,  0.00986,  0.01062,
        0.01064,  0.01008,  0.00992,  0.00972,  0.01006,  0.01026,
        0.01018,  0.01044,  0.0092 ,  0.00982,  0.00994,  0.00958,
        0.00958,  0.01012,  0.01024,  0.00996])

这是时间的比较； 注意结果单位的变化：

In [24]: %timeit npi.count(x)[1]/len(x)
1.35 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [25]: %timeit np.bincount(x)/len(x)
76.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Answer 2

此方法胜过您的97.6％：

def proba_2(x):
    n = len(x)
    single_prob = 1/n
    d = {}
    for i in x:
        if i in d:
            d[i] += single_prob
        else:
            d[i] = single_prob
    return d

尽管没有很大的余地（1000次运行的0.006为0.006 ）。 本质上，您的代码是经过算法优化的（因为它是O(n) ，剩下的就是微优化。

完整的测试代码：

from collections import Counter
from timeit import Timer
import random

def proba_1(x):
    n = len(x)
    return {key: val/n for key, val in dict(Counter(x)).items()}

def proba_2(x):
    n = len(x)
    single_prob = 1/n
    d = {}
    for i in x:
        if i in d:
            d[i] += single_prob
        else:
            d[i] = single_prob
    return d


t = Timer(lambda: proba_1(l))
t_2 = Timer(lambda: proba_2(l))

p1 = 0
p2 = 0

total_diff = 0.0

for i in range(1,1001):
    l = [random.randrange(1,101,1) for _ in range (100)]
    if i % 2 == 0:
        proba_1_time = t.timeit(number=10000)
        proba_2_time = t_2.timeit(number=10000)
    else:
        proba_2_time = t_2.timeit(number=10000)
        proba_1_time = t.timeit(number=10000)

    print(proba_1(l),proba_1_time, proba_2(l), proba_2_time)
    if proba_1_time < proba_2_time:
        print("Proba_1 wins: " + str(proba_1_time))
        p1 += 1
    else:
        print("Proba_2 wins: " + str(proba_2_time))
        p2 += 1
    total_diff += abs(proba_1_time - proba_2_time)

    print(p1,p2, total_diff/i)

Answer 3

numpy_indexed软件包（免责声明：我是它的作者）提供了numpy arraysetops模块的概括； 包括实用程序，以优雅和矢量化的方式解决您的问题：

import numpy_indexed as npi
keys, counts = npi.count(x)
proba = counts / len(x)

不确定计数器在性能方面如何叠加； 我相信这是非常好的优化。 但是，在x元素本身可以表示为ndarray的情况下，我希望此方法可以继续。

获取列表中元素存在的概率的最快方法

问题描述

3 个解决方案

解决方案1
2 2017-05-09 19:20:27

解决方案2
1 2017-05-09 03:58:26

解决方案3
1 已采纳 2017-05-09 06:43:02

获取列表中元素存在的概率的最快方法

问题描述

3 个解决方案

解决方案1 2 2017-05-09 19:20:27

解决方案2 1 2017-05-09 03:58:26

解决方案3 1 已采纳 2017-05-09 06:43:02

解决方案1
2 2017-05-09 19:20:27

解决方案2
1 2017-05-09 03:58:26

解决方案3
1 已采纳 2017-05-09 06:43:02