快速方法将列表中的项目数量降至某个阈值以下

Question

我正在尝试快速检查列表中有多少项目低于一系列阈值，类似于执行此处描述但很多次。 这一点的重点是对机器学习模型进行一些诊断，这些模型比sci-kit学习中的内容（ROC曲线等）更深入。

想象一下， preds是一个预测列表（0到1之间的概率）。 实际上，我将拥有超过100万，这就是为什么我要加快速度。

这会创建一些假分数，通常分布在0和1之间。

fake_preds = [np.random.normal(0, 1) for i in range(1000)]
fake_preds = [(pred + np.abs(min(fake_preds)))/max(fake_preds + np.abs(min(fake_preds))) for pred in fake_preds]

现在，我这样做的方法是循环100个阈值水平并检查在任何给定阈值下有多少预测值更低：

thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]

这需要大约1.5秒的10k（比生成假预测的时间少）但你可以想象它需要更长的时间来预测更多。 我必须做几千次来比较一堆不同的模型。

有关使第二个代码块更快的方法的任何想法？ 我认为必须有一种方法来命令预测，使计算机更容易检查阈值（类似于类似SQL的方案中的索引），但我无法找出除sum(fake_preds < thresh)之外的任何其他方式sum(fake_preds < thresh)检查它们，并没有利用任何索引或订购。

在此先感谢您的帮助！

Answer 1

一种方法是使用numpy.histogram 。

thresh_cov = np.histogram(fake_preds, len(thresholds))[0].cumsum()

从timeit ，我越来越：

%timeit my_cov = np.histogram(fake_preds, len(thresholds))[0].cumsum()
169 µs ± 6.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
172 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

方法＃1

您可以对predictions数组进行排序，然后使用searchsorted或np.digitize ，就像这样 -

np.searchsorted(np.sort(fake_preds), thresholds, 'right')

np.digitize(thresholds, np.sort(fake_preds))

如果您不介意变异predictions数组，请使用： fake_preds.sort()进行就地排序，然后使用fake_preds代替np.sort(fake_preds) 。 这应该更高效，因为我们将避免在那里使用任何额外的内存。

方法＃2

现在，与阈值是100从0到1 ，这些阈值将是的倍数0.01 。 因此，我们可以简单地将每个数字按比例缩放100并将它们转换为ints ，这可以非常直接地作为bins到np.bincount 。 然后，为了得到或想要的结果，使用cumsum ，像这样 -

np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

标杆

方法 -

def searchsorted_app(fake_preds, thresholds):
    return np.searchsorted(np.sort(fake_preds), thresholds, 'right')

def digitize_app(fake_preds, thresholds):
    return np.digitize(thresholds, np.sort(fake_preds) )

def bincount_app(fake_preds, thresholds):
    return np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

10000元素的运行时测试和验证 -

In [210]: np.random.seed(0)
     ...: fake_preds = np.random.rand(10000)
     ...: thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
     ...: thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
     ...: 

In [211]: print np.allclose(thresh_cov, searchsorted_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, digitize_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, bincount_app(fake_preds, thresholds))
     ...: 
True
True
True

In [214]: %timeit [sum(fake_preds < thresh) for thresh in thresholds]
1 loop, best of 3: 1.43 s per loop

In [215]: %timeit searchsorted_app(fake_preds, thresholds)
     ...: %timeit digitize_app(fake_preds, thresholds)
     ...: %timeit bincount_app(fake_preds, thresholds)
     ...: 
1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 535 µs per loop
10000 loops, best of 3: 24.9 µs per loop

对于searchsorted ，这是一个2,700x+加速，对于bincount一个是57,000x+ ！ 对于较大的数据集， bincount和searchsorted之间的差距必然会增加，因为bincount不需要排序。

Answer 3

您可以在此处重新设置thresholds以启用广播。 首先，这里有一些可能的更改，你创建的fake_preds和thresholds ，摆脱循环。

np.random.seed(123)
fake_preds = np.random.normal(size=1000)
fake_preds = (fake_preds + np.abs(fake_preds.min())) \
           / (np.max(fake_preds + np.abs((fake_preds.min()))))
thresholds = np.linspace(.01, 1, 100)

然后你要做的就是在1行中完成：

print(np.sum(np.less(fake_preds, np.tile(thresholds, (1000,1)).T), axis=1))
[  2   2   2   2   2   2   5   5   6   7   7  11  11  11  15  18  21  26
  28  34  40  48  54  63  71  77  90 100 114 129 143 165 176 191 206 222
 240 268 288 312 329 361 392 417 444 479 503 532 560 598 615 648 671 696
 710 726 747 768 787 800 818 840 860 877 891 902 912 919 928 942 947 960
 965 970 978 981 986 987 988 991 993 994 995 995 995 997 997 997 998 998
 999 999 999 999 999 999 999 999 999 999]

演练：

fake_preds有形状（1000,1）。 您需要将thresholds操作为与此广播兼容的形状。 （参见一般广播规则。）

可播放的第二种形状将是

print(np.tile(thresholds, (1000,1)).T.shape)
# (100, 1000)

Answer 4

选项1：

from scipy.stats import percentileofscore 
thresh_cov = [percentileofscore (fake_preds, thresh) for thresh in thresholds]

选项2：与上述相同，但首先对列表进行排序

选项3：将阈值插入列表，对列表进行排序，找到阈值的索引。 请注意，如果您有快速排序算法，则可以通过将阈值设置为枢轴并在根据阈值对所有内容进行分区后终止排序来优化它。

选项4：基于上述内容：将阈值放在二叉树中，然后对列表中的每个项目，将其与二进制搜索中的阈值进行比较。 您可以逐项执行此操作，也可以在每个步骤将列表拆分为子集。

快速方法将列表中的项目数量降至某个阈值以下

问题描述

4 个解决方案

解决方案1
2 2017-10-30 20:13:24

解决方案2
2 已采纳 2017-10-30 20:24:23

标杆

解决方案3
1 2017-10-30 20:10:34

解决方案4
0 2017-10-30 20:17:47

快速方法将列表中的项目数量降至某个阈值以下

问题描述

4 个解决方案

解决方案1 2 2017-10-30 20:13:24

解决方案2 2 已采纳 2017-10-30 20:24:23

标杆

解决方案3 1 2017-10-30 20:10:34

解决方案4 0 2017-10-30 20:17:47

解决方案1
2 2017-10-30 20:13:24

解决方案2
2 已采纳 2017-10-30 20:24:23

解决方案3
1 2017-10-30 20:10:34

解决方案4
0 2017-10-30 20:17:47