快速方法將列表中的項目數量降至某個閾值以下

Question

我正在嘗試快速檢查列表中有多少項目低於一系列閾值，類似於執行此處描述但很多次。 這一點的重點是對機器學習模型進行一些診斷，這些模型比sci-kit學習中的內容（ROC曲線等）更深入。

想象一下， preds是一個預測列表（0到1之間的概率）。 實際上，我將擁有超過100萬，這就是為什么我要加快速度。

這會創建一些假分數，通常分布在0和1之間。

fake_preds = [np.random.normal(0, 1) for i in range(1000)]
fake_preds = [(pred + np.abs(min(fake_preds)))/max(fake_preds + np.abs(min(fake_preds))) for pred in fake_preds]

現在，我這樣做的方法是循環100個閾值水平並檢查在任何給定閾值下有多少預測值更低：

thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]

這需要大約1.5秒的10k（比生成假預測的時間少）但你可以想象它需要更長的時間來預測更多。 我必須做幾千次來比較一堆不同的模型。

有關使第二個代碼塊更快的方法的任何想法？ 我認為必須有一種方法來命令預測，使計算機更容易檢查閾值（類似於類似SQL的方案中的索引），但我無法找出除sum(fake_preds < thresh)之外的任何其他方式sum(fake_preds < thresh)檢查它們，並沒有利用任何索引或訂購。

在此先感謝您的幫助！

Answer 1

一種方法是使用numpy.histogram 。

thresh_cov = np.histogram(fake_preds, len(thresholds))[0].cumsum()

從timeit ，我越來越：

%timeit my_cov = np.histogram(fake_preds, len(thresholds))[0].cumsum()
169 µs ± 6.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
172 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

方法＃1

您可以對predictions數組進行排序，然后使用searchsorted或np.digitize ，就像這樣 -

np.searchsorted(np.sort(fake_preds), thresholds, 'right')

np.digitize(thresholds, np.sort(fake_preds))

如果您不介意變異predictions數組，請使用： fake_preds.sort()進行就地排序，然后使用fake_preds代替np.sort(fake_preds) 。 這應該更高效，因為我們將避免在那里使用任何額外的內存。

方法＃2

現在，與閾值是100從0到1 ，這些閾值將是的倍數0.01 。 因此，我們可以簡單地將每個數字按比例縮放100並將它們轉換為ints ，這可以非常直接地作為bins到np.bincount 。 然后，為了得到或想要的結果，使用cumsum ，像這樣 -

np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

標桿

方法 -

def searchsorted_app(fake_preds, thresholds):
    return np.searchsorted(np.sort(fake_preds), thresholds, 'right')

def digitize_app(fake_preds, thresholds):
    return np.digitize(thresholds, np.sort(fake_preds) )

def bincount_app(fake_preds, thresholds):
    return np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

10000元素的運行時測試和驗證 -

In [210]: np.random.seed(0)
     ...: fake_preds = np.random.rand(10000)
     ...: thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
     ...: thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
     ...: 

In [211]: print np.allclose(thresh_cov, searchsorted_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, digitize_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, bincount_app(fake_preds, thresholds))
     ...: 
True
True
True

In [214]: %timeit [sum(fake_preds < thresh) for thresh in thresholds]
1 loop, best of 3: 1.43 s per loop

In [215]: %timeit searchsorted_app(fake_preds, thresholds)
     ...: %timeit digitize_app(fake_preds, thresholds)
     ...: %timeit bincount_app(fake_preds, thresholds)
     ...: 
1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 535 µs per loop
10000 loops, best of 3: 24.9 µs per loop

對於searchsorted ，這是一個2,700x+加速，對於bincount一個是57,000x+ ！ 對於較大的數據集， bincount和searchsorted之間的差距必然會增加，因為bincount不需要排序。

Answer 3

您可以在此處重新設置thresholds以啟用廣播。 首先，這里有一些可能的更改，你創建的fake_preds和thresholds ，擺脫循環。

np.random.seed(123)
fake_preds = np.random.normal(size=1000)
fake_preds = (fake_preds + np.abs(fake_preds.min())) \
           / (np.max(fake_preds + np.abs((fake_preds.min()))))
thresholds = np.linspace(.01, 1, 100)

然后你要做的就是在1行中完成：

print(np.sum(np.less(fake_preds, np.tile(thresholds, (1000,1)).T), axis=1))
[  2   2   2   2   2   2   5   5   6   7   7  11  11  11  15  18  21  26
  28  34  40  48  54  63  71  77  90 100 114 129 143 165 176 191 206 222
 240 268 288 312 329 361 392 417 444 479 503 532 560 598 615 648 671 696
 710 726 747 768 787 800 818 840 860 877 891 902 912 919 928 942 947 960
 965 970 978 981 986 987 988 991 993 994 995 995 995 997 997 997 998 998
 999 999 999 999 999 999 999 999 999 999]

演練：

fake_preds有形狀（1000,1）。 您需要將thresholds操作為與此廣播兼容的形狀。 （參見一般廣播規則。）

可播放的第二種形狀將是

print(np.tile(thresholds, (1000,1)).T.shape)
# (100, 1000)

Answer 4

選項1：

from scipy.stats import percentileofscore 
thresh_cov = [percentileofscore (fake_preds, thresh) for thresh in thresholds]

選項2：與上述相同，但首先對列表進行排序

選項3：將閾值插入列表，對列表進行排序，找到閾值的索引。 請注意，如果您有快速排序算法，則可以通過將閾值設置為樞軸並在根據閾值對所有內容進行分區后終止排序來優化它。

選項4：基於上述內容：將閾值放在二叉樹中，然后對列表中的每個項目，將其與二進制搜索中的閾值進行比較。 您可以逐項執行此操作，也可以在每個步驟將列表拆分為子集。

快速方法將列表中的項目數量降至某個閾值以下

問題描述

4 個解決方案

解決方案1
2 2017-10-30 20:13:24

解決方案2
2 已采納 2017-10-30 20:24:23

標桿

解決方案3
1 2017-10-30 20:10:34

解決方案4
0 2017-10-30 20:17:47

快速方法將列表中的項目數量降至某個閾值以下

問題描述

4 個解決方案

解決方案1 2 2017-10-30 20:13:24

解決方案2 2 已采納 2017-10-30 20:24:23

標桿

解決方案3 1 2017-10-30 20:10:34

解決方案4 0 2017-10-30 20:17:47

解決方案1
2 2017-10-30 20:13:24

解決方案2
2 已采納 2017-10-30 20:24:23

解決方案3
1 2017-10-30 20:10:34

解決方案4
0 2017-10-30 20:17:47