简体   繁体   English

通过矢量化提高 np.irr function 的性能

[英]Improve performance of the np.irr function through vectorization

Is it possible to improve the performance of the np.irr function such that it can applied to a 2-dimension array of cash flows without using a for-loop--either though vectorizing the np.irr function or through an alternative algorithm?是否可以提高 np.irr function 的性能,使其可以在不使用 for 循环的情况下应用于二维现金流数组 - 通过矢量化 np.irr function 或通过替代算法?

The irr function in the numpy library calculates the periodically compounded rate of return that gives a net present value of 0 for an array of cash flows. numpy 库中的 irr function 计算周期性复合收益率,该收益率为一组现金流提供净现值为 0。 This function can only be applied to a 1-dimensional array:这个 function 只能应用于一维数组:

x = np.array([-100,50,50,50])
r = np.irr(x)

np.irr will not work against a 2-dimensional array of cash flows, such as: np.irr 不适用于二维现金流数组,例如:

cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50

where each row represents a series of cash flows, and columns represent time periods.其中每一行代表一系列现金流,列代表时间段。 Therefore a slow implementation would be to loop over each row and apply np.irr to individual rows:因此,一个缓慢的实现是遍历每一行并将 np.irr 应用于各个行:

out = []
for x in cfs:
    out.append(np.irr(x))

For large arrays, this is an optimization barrier.对于大型 arrays,这是一个优化障碍。 Looking at the source code of the np.irr function, I believe the main obstacle is vectorizing the np.roots function:查看 np.irr function 的源代码,我相信主要障碍是矢量化 np.roots function:

def irr(values):
    res = np.roots(values[::-1])
    mask = (res.imag == 0) & (res.real > 0)
    if res.size == 0:
        return np.nan
    res = res[mask].real
    # NPV(rate) = 0 can have more than one solution so we return
    # only the solution closest to zero.
    rate = 1.0/res - 1
    rate = rate.item(np.argmin(np.abs(rate)))
    return rate

I have found a similar implementation in R: Fast loan rate calculation for a big number of loans , but don't know how to port this into Python.我在 R 中找到了类似的实现: 快速计算大量贷款的贷款利率,但不知道如何将其移植到 Python。 Also, I don't consider np.apply_along_axis or np.vectorize to be solutions to this issue since my main concern is performance, and I understand both are wrappers for a for-loop.另外,我不认为 np.apply_along_axis 或 np.vectorize 可以解决这个问题,因为我主要关心的是性能,而且我知道它们都是 for 循环的包装器。

Thanks!谢谢!

Looking at the source of np.roots , 看一下np.roots的来源,

import inspect
print(inspect.getsource(np.roots))

We see that it works by finding the eigenvalues of the "companion matrix". 我们看到它的工作原理是找到“伴随矩阵”的特征值。 It also does some special handling of coefficients that are zero. 它还对零系数进行了一些特殊处理。 I really don't understand the mathematical background, but I do know that np.linalg.eigvals can calculate eigenvalues for multiple matrices in a vectorized way. 我真的不明白数学背景,但我知道np.linalg.eigvals可以用矢量化方式计算多个矩阵的特征值。

Merging it with the source of np.irr has resulted in the following "Frankencode": 将它与np.irr的源合并导致以下“Frankencode”:

def irr_vec(cfs):
    # Create companion matrix for every row in `cfs`
    M, N = cfs.shape
    A = np.zeros((M, (N-1)**2))
    A[:,N-1::N] = 1
    A = A.reshape((M,N-1,N-1))
    A[:,0,:] = cfs[:,-2::-1] / -cfs[:,-1:]  # slice [-1:] to keep dims

    # Calculate roots; `eigvals` is a gufunc
    res = np.linalg.eigvals(A)

    # Find the solution that makes the most sense...
    mask = (res.imag == 0) & (res.real > 0)
    res = np.ma.array(res.real, mask=~mask, fill_value=np.nan)
    rate = 1.0/res - 1
    idx = np.argmin(np.abs(rate), axis=1)
    irr = rate[np.arange(M), idx].filled()
    return irr

This does not do handling of zero coefficients and surely fails when any(cfs[:,-1] == 0) . 这不会处理零系数,并且当any(cfs[:,-1] == 0)时肯定会失败。 Also some input argument checking wouldn't hurt. 还有一些输入参数检查不会受到伤害。 And some other problems maybe? 还有其他一些问题吗? But for the supplied example data it achieves what we wanted (at the cost of increased memory use): 但是对于提供的示例数据,它实现了我们想要的(以增加内存使用为代价):

In [487]: cfs = np.zeros((10000,4))
     ...: cfs[:,0] = -100
     ...: cfs[:,1:] = 50

In [488]: %timeit [np.irr(x) for x in cfs]
1 loops, best of 3: 2.96 s per loop

In [489]: %timeit irr_vec(cfs)
10 loops, best of 3: 77.8 ms per loop

If you have the special case of loans with a fixed payback amount (like in the question you linked) you may be able do it faster using interpolation... 如果您有固定回收金额的贷款的特殊情况(如您链接的问题),您可以使用插值更快地完成...

After I posted this question I worked on this question and came up with a vectorized solution that uses a different algorithm: 在我发布这个问题后,我研究了这个问题,并提出了一个使用不同算法的矢量化解决方案:

def virr(cfs, precision = 0.005, rmin = 0, rmax1 = 0.3, rmax2 = 0.5):
    ''' 
    Vectorized IRR calculator. First calculate a 3D array of the discounted
    cash flows along cash flow series, time period, and discount rate. Sum over time to 
    collapse to a 2D array which gives the NPV along a range of discount rates 
    for each cash flow series. Next, find crossover where NPV is zero--corresponds
    to the lowest real IRR value. For performance, negative IRRs are not calculated
    -- returns "-1", and values are only calculated to an acceptable precision.

    IN:
        cfs - numpy 2d array - rows are cash flow series, cols are time periods
        precision - level of accuracy for the inner IRR band eg 0.005%
        rmin - lower bound of the inner IRR band eg 0%
        rmax1 - upper bound of the inner IRR band eg 30%
        rmax2 - upper bound of the outer IRR band. eg 50% Values in the outer 
                band are calculated to 1% precision, IRRs outside the upper band 
                return the rmax2 value
    OUT:
        r - numpy column array of IRRs for cash flow series
    '''

    if cfs.ndim == 1: 
        cfs = cfs.reshape(1,len(cfs))

    # Range of time periods
    years = np.arange(0,cfs.shape[1])

    # Range of the discount rates
    rates_length1 = int((rmax1 - rmin)/precision) + 1
    rates_length2 = int((rmax2 - rmax1)/0.01)
    rates = np.zeros((rates_length1 + rates_length2,))
    rates[:rates_length1] = np.linspace(0,0.3,rates_length1)
    rates[rates_length1:] = np.linspace(0.31,0.5,rates_length2)

    # Discount rate multiplier rows are years, cols are rates
    drm = (1+rates)**-years[:,np.newaxis]

    # Calculate discounted cfs   
    discounted_cfs = cfs[:,:,np.newaxis] * drm

    # Calculate NPV array by summing over discounted cashflows
    npv = discounted_cfs.sum(axis = 1)

    ## Find where the NPV changes sign, implies an IRR solution
    signs = npv < 0

    # Find the pairwise differences in boolean values when sign crosses over, the
    # pairwise diff will be True
    crossovers = np.diff(signs,1,1)

    # Extract the irr from the first crossover for each row
    irr = np.min(np.ma.masked_equal(rates[1:]* crossovers,0),1)

    # Error handling, negative irrs are returned as "-1", IRRs greater than rmax2 are
    # returned as rmax2
    negative_irrs = cfs.sum(1) < 0
    r = np.where(negative_irrs,-1,irr)
    r = np.where(irr.mask * (negative_irrs == False), 0.5, r)

    return r

Performance: 性能:

import numpy as np
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50

%timeit [np.irr(x) for x in cfs]
10 loops, best of 3: 1.06 s per loop

%timeit virr(cfs)
10 loops, best of 3: 29.5 ms per loop

pyxirr is super fast, and np.irr is deprecated, so I'd use this now: pyxirr 非常快,并且 np.irr 已被弃用,所以我现在使用它:

https://pypi.org/project/pyxirr/ https://pypi.org/project/pyxirr/

import pyxirr

cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50

df = pd.DataFrame(cfs).T
df.apply(pyxirr.irr)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM