如何使用numpy.where（）加速我的numpy循环

Question

I have written a function about ordered logit model, recently. 我最近编写了一个关于有序logit模型的函数。
But it takes me lots of time when running big data. 但是在运行大数据时需要花费很多时间。
So I want to rewrite the code and substitute numpy.where function to if statement. 所以我想重写代码并将numpy.where函数替换为if语句。
There have some problem about my new code, I don't know how to do it. 我的新代码有一些问题，我不知道怎么做。
If you know, Please help me. 如果你知道，请帮助我。 Thank you very much! 非常感谢你！

This is my original function. 这是我原来的功能。

import numpy as np
from scipy.stats import logistic

def func(y, X, thresholds):
    ll = 0.0
    for row in zip(y, X):
        if row[0] == 0:
           ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
           ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
           for i in xrange(1, len(thresholds)):
               if row[0] == i:
                   diff_prob = logistic.cdf(thresholds[i] - row[1]) - logistic.cdf(thresholds[i - 1] - row[1])
                   if diff_prob <= 10 ** -5:
                       ll += np.log(10 ** -5)
                   else:
                       ll += np.log(diff_prob)
     return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print func(y, X, thresholds)

This is the new but not perfect code. 这是新的但不完美的代码。

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
ll = np.where(y == 0, logistic.logcdf(thresholds[0] - X),
          np.where(y == len(thresholds), logistic.logcdf(X - thresholds[-1]),
                   np.log(logistic.cdf(thresholds[1] - X) - logistic.cdf(thresholds[0] - X))))
print ll.sum()

The problem is that I don't know how to rewrite the sub-loop( for i in xrange(1, len(thresholds)): ) function. 问题是我不知道如何重写子循环（ 对于i in xrange（1，len（thresholds））:)函数。

Answer 1

I think asking how to implement it just using np.where is a bit of an X/Y problem . 我np.where一下如何使用np.where来实现它是一个X / Y问题。

So I'll try to explain how I would approach optimizing this function. 所以我将尝试解释如何优化此功能。

My first instinct is to get rid of the for loop, which was the pain point anyway: 我的第一直觉是摆脱for循环，这无论如何都是痛点：

import numpy as np
from scipy.stats import logistic

def func1(y, X, thresholds):
    ll = 0.0
    for row in zip(y, X):
        if row[0] == 0:
            ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
            ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
            diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
                         logistic.cdf(thresholds[row[0] - 1] - row[1])
            diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
            ll += np.log(diff_prob)
    return ll

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))

I have just replaced i with row[0] , without changing the semantics of the loop. 我刚刚用row[0]替换了i ，而没有改变循环的语义。 So that's one for loop less. 所以这是一个少循环。

Now I would like to have the form of the statements in the different branches of the if-else to be the same. 现在我希望if-else的不同分支中的语句形式是相同的。 To that end: 为此：

import numpy as np
from scipy.stats import logistic

def func2(y, X, thresholds):
    ll = 0.0

    for row in zip(y, X):
        if row[0] == 0:
            ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
            ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
            ll += np.log(
                np.maximum(
                    10 ** -5, 
                    logistic.cdf(thresholds[row[0]] - row[1]) -
                     logistic.cdf(thresholds[row[0] - 1] - row[1])
                )
            )
    return ll

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))

Now the expression in each branch is of the form ll += expr . 现在每个分支中的表达式的形式为ll += expr 。

At this piont there are a couple of different paths the optimization can take. 在这种情况下，优化可以采用几种不同的路径。 You can try to optimize the loop away by writing it as a comprehension, but I suspect that it'll not give you much increase in speed. 您可以尝试通过将其作为一种理解来优化循环，但我怀疑它不会给你太多的速度提升。

An alternate path is to pull the if conditions out of the loop. 另一条路径是将if条件拉出循环。 That is what your intent with np.where was as well: 这就是你对np.where的意图：

import numpy as np
from scipy.stats import logistic

def func3(y, X, thresholds):
    y_0 = y == 0
    y_end = y == len(thresholds)
    y_rest = ~(y_0 | y_end)

    ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
    ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
    ll_3 = np.log(
        np.maximum(
            10 ** -5, 
            logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
              logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
        )
    )
    return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)

y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))

Note that I turned X into an np.array to be able to use fancy indexing on it. 请注意，我将X转换为np.array ，以便能够使用花式索引。

At this point, I'd wager that it is fast enough for my purposes. 在这一点上，我打赌它对我的目的足够快。 However, you can stop earlier or beyond this point, depending on your requirements. 但是，根据您的要求，您可以提前或超出此点。

On my computer, I get the following results: 在我的计算机上，我得到以下结果：

y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))

%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop

%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop

%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop

%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop

如何使用numpy.where（）加速我的numpy循环

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-08-03 20:44:05

如何使用numpy.where（）加速我的numpy循环

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-08-03 20:44:05

解决方案1
4 已采纳 2015-08-03 20:44:05