[英]Fastest way to get the probabilties for which an element is present in a list
我创建了一个函数,该函数将返回包含输入列表元素以及从列表中选择该项目的可能性的字典:
from collections import Counter
def proba(x):
n = len(x)
return {key: val/n for key, val in dict(Counter(x)).items()}
有更快的解决方案吗? 如果概率的输出顺序与元素的输入顺序相对应,则我不需要将输出设为kay:value对。
在评论Eelco的答案时,您写道
如果输入是np.random.randint(low = 0,high = 100,size = 50000)...
numpy_indexed
有一些强大的工具,但是对于这样的数据,您可以使用numpy.bincount
获得更好的性能:
In [11]: import numpy as np
In [12]: import numpy_indexed as npi
In [13]: x = np.random.randint(low=0, high=100, size=50000)
这是使用numpy.bincount
的计算。 结果是一个长度为x.max()+1
的数组。
In [14]: np.bincount(x)/len(x)
Out[14]:
array([ 0.01066, 0.01022, 0.01048, 0.00994, 0.01026, 0.00972,
0.0107 , 0.00962, 0.0098 , 0.00922, 0.00996, 0.01038,
0.01024, 0.01118, 0.01012, 0.01098, 0.00988, 0.00996,
0.00974, 0.0097 , 0.00994, 0.01004, 0.0099 , 0.01034,
0.01066, 0.01032, 0.01042, 0.00896, 0.00958, 0.01008,
0.01038, 0.00974, 0.01068, 0.00952, 0.00998, 0.00932,
0.00964, 0.0103 , 0.0099 , 0.0093 , 0.0101 , 0.01012,
0.0097 , 0.00988, 0.0099 , 0.01076, 0.01008, 0.0097 ,
0.00986, 0.00998, 0.00976, 0.00984, 0.01008, 0.01008,
0.00938, 0.00998, 0.00976, 0.0093 , 0.00974, 0.00958,
0.00984, 0.01032, 0.00988, 0.01014, 0.01088, 0.01006,
0.0097 , 0.01026, 0.00952, 0.01002, 0.00938, 0.01024,
0.01024, 0.00984, 0.00922, 0.01044, 0.0101 , 0.01052,
0.01002, 0.00996, 0.0101 , 0.00976, 0.00986, 0.01062,
0.01064, 0.01008, 0.00992, 0.00972, 0.01006, 0.01026,
0.01018, 0.01044, 0.0092 , 0.00982, 0.00994, 0.00958,
0.00958, 0.01012, 0.01024, 0.00996])
这是时间的比较; 注意结果单位的变化:
In [24]: %timeit npi.count(x)[1]/len(x)
1.35 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: %timeit np.bincount(x)/len(x)
76.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
此方法胜过您的97.6%:
def proba_2(x):
n = len(x)
single_prob = 1/n
d = {}
for i in x:
if i in d:
d[i] += single_prob
else:
d[i] = single_prob
return d
尽管没有很大的余地(1000次运行的0.006
为0.006
)。 本质上,您的代码是经过算法优化的(因为它是O(n)
,剩下的就是微优化。
完整的测试代码:
from collections import Counter
from timeit import Timer
import random
def proba_1(x):
n = len(x)
return {key: val/n for key, val in dict(Counter(x)).items()}
def proba_2(x):
n = len(x)
single_prob = 1/n
d = {}
for i in x:
if i in d:
d[i] += single_prob
else:
d[i] = single_prob
return d
t = Timer(lambda: proba_1(l))
t_2 = Timer(lambda: proba_2(l))
p1 = 0
p2 = 0
total_diff = 0.0
for i in range(1,1001):
l = [random.randrange(1,101,1) for _ in range (100)]
if i % 2 == 0:
proba_1_time = t.timeit(number=10000)
proba_2_time = t_2.timeit(number=10000)
else:
proba_2_time = t_2.timeit(number=10000)
proba_1_time = t.timeit(number=10000)
print(proba_1(l),proba_1_time, proba_2(l), proba_2_time)
if proba_1_time < proba_2_time:
print("Proba_1 wins: " + str(proba_1_time))
p1 += 1
else:
print("Proba_2 wins: " + str(proba_2_time))
p2 += 1
total_diff += abs(proba_1_time - proba_2_time)
print(p1,p2, total_diff/i)
numpy_indexed软件包(免责声明:我是它的作者)提供了numpy arraysetops模块的概括; 包括实用程序,以优雅和矢量化的方式解决您的问题:
import numpy_indexed as npi
keys, counts = npi.count(x)
proba = counts / len(x)
不确定计数器在性能方面如何叠加; 我相信这是非常好的优化。 但是,在x元素本身可以表示为ndarray的情况下,我希望此方法可以继续。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.