[英]Speed up an integration function in Python
我有一個函數,它是一些大問題的內在循環。 因此,這將被稱為數百萬次。 我已經嘗試對其進行優化。 但是,由於這是我的第一個數字項目,所以我想知道是否還有其他方法可以提高速度。
cython似乎無濟於事。 也許numpy已經接近c了。 否則我無法有效編寫cython代碼。
import numpy as np
import math
import numexpr as ne
par_mu_rho = 0.8
par_alpha_rho = 0.7
# ' the first two are mean of mus and the '
# ' last two are the mean of alphas.'
cov_epsilon = [[1, par_mu_rho], [par_mu_rho, 1]]
cov_nu = [[1, par_alpha_rho], [par_alpha_rho, 1]]
nrows = 10000
np.random.seed(123)
epsilon_sim = np.random.multivariate_normal([0, 0], cov_epsilon, nrows)
nu_sim = np.random.multivariate_normal([0, 0], cov_nu, nrows)
errors = np.concatenate((epsilon_sim, nu_sim), axis=1)
errors = np.exp(errors)
### the function to be optimized
def mktout(mean_mu_alpha, errors, par_gamma):
mu10 = errors[:, 0] * math.exp(mean_mu_alpha[0])
mu11 = math.exp(par_gamma) * mu10 # mu with gamma
mu20 = errors[:, 1] * math.exp(mean_mu_alpha[1])
mu21 = math.exp(par_gamma) * mu20
alpha1 = errors[:, 2] * math.exp(mean_mu_alpha[2])
alpha2 = errors[:, 3] * math.exp(mean_mu_alpha[3])
j_is_larger = (mu10 > mu20)
# useneither1 = (mu10 < 1/168)
threshold2 = (1 + mu10 * alpha1) / (168 + alpha1)
# useboth1 = (mu21 >= threshold2)
j_is_smaller = ~j_is_larger
# useneither2 = (mu20 < 1/168)
threshold3 = (1 + mu20 * alpha2) / (168 + alpha2)
# useboth2 = (mu11 >= threshold3)
case1 = j_is_larger * (mu10 < 1 / 168)
case2 = j_is_larger * (mu21 >= threshold2)
# case3 = j_is_larger * (1 - (useneither1 | useboth1))
case3 = j_is_larger ^ (case1 | case2)
case4 = j_is_smaller * (mu20 < 1 / 168)
case5 = j_is_smaller * (mu11 >= threshold3)
# case6 = j_is_smaller * (1 - (useneither2 | useboth2))
case6 = j_is_smaller ^ (case4 | case5)
t0 = ne.evaluate(
"case1*168+case2 * (168 + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) +case3 / threshold2 +case4 * 168 +case5 * (168 + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case6 / threshold3")
# for some cases, t1 would be 0 anyway, so they are omitted here.
t1 = ne.evaluate(
"case2 * (t0 * alpha1 * mu11 - alpha1) +case3 * (t0 * alpha1 * mu10 - alpha1) +case5 * (t0 * alpha1 * mu11 - alpha1)")
# t2 = (j_is_larger*useboth1*(t0*alpha2*mu21- alpha2) +
# j_is_smaller*useboth2*(t0*alpha2*mu21- alpha2) +
# j_is_smaller*(1- (useneither2|useboth2))*(t0*alpha2*mu20 - alpha2)
# )
t2 = 168 - t0 - t1
p12 = case2 + case5
p1 = case3 + p12
p2 = case6 + p12
return t1.sum()/10000, t2.sum()/10000, p1.sum()/10000, p2.sum()/10000
timeit mktout([-6,-6,-1,-1], errors, -0.7)
在裝有2.2GHz i7的舊Mac上。 該功能的運行時間約為200µs。
更新:
根據@CodeSurgeon和@ GZ0的建議和代碼,我決定使用以下代碼
def mktout_full(double[:] mean_mu_alpha, double[:, ::1] errors, double par_gamma):
cdef:
size_t i, n
double[4] exp
double exp_par_gamma
double mu10, mu11, mu20, mu21
double alpha1, alpha2
double threshold2, threshold3
double t0, t1, t2
double t1_sum, t2_sum, p1_sum, p2_sum, p12_sum
double c
#compute the exp outside of the loop
n = errors.shape[0]
exp[0] = cmath.exp(<double>mean_mu_alpha[0])
exp[1] = cmath.exp(<double>mean_mu_alpha[1])
exp[2] = cmath.exp(<double>mean_mu_alpha[2])
exp[3] = cmath.exp(<double>mean_mu_alpha[3])
exp_par_gamma = cmath.exp(par_gamma)
c = 168.0
t1_sum = 0.0
t2_sum = 0.0
p1_sum = 0.0
p2_sum = 0.0
p12_sum = 0.0
for i in range(n) :
mu10 = errors[i, 0] * exp[0]
# mu11 = exp_par_gamma * mu10
mu20 = errors[i, 1] * exp[1]
# mu21 = exp_par_gamma * mu20
# alpha1 = errors[i, 2] * exp[2]
# alpha2 = errors[i, 3] * exp[3]
# j_is_larger = mu10 > mu20
# j_is_smaller = not j_is_larger
if (mu10 >= mu20):
if (mu10 >= 1/c) :
mu21 = exp_par_gamma * mu20
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
threshold2 = (1 + mu10 * alpha1) / (c + alpha1)
if (mu21 >= threshold2):
mu11 = exp_par_gamma * mu10
t0 = (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2)
t1 = (t0 * alpha1 * mu11 - alpha1)
t1_sum += t1
t2_sum += c - t0 - t1
p1_sum += 1
p2_sum += 1
p12_sum += 1
else :
t1_sum += ((1/threshold2) * alpha1 * mu10 - alpha1)
p1_sum += 1
else :
if (mu20 >= 1/c) :
mu11 = exp_par_gamma * mu10
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
threshold3 = (1 + mu20 * alpha2) / (c + alpha2)
if (mu11 >= threshold3):
mu21 = exp_par_gamma * mu20
t0 = (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2)
t1 = (t0 * alpha1 * mu11 - alpha1)
t1_sum += t1
t2_sum += c - t0 - t1
p1_sum += 1
p2_sum += 1
p12_sum += 1
else :
t2_sum += ((1/threshold3) * alpha2 * mu20 - alpha2)
p2_sum += 1
return t1_sum/n, t2_sum/n, p1_sum/n, p2_sum/n, p12_sum/n
我的原始代碼運行時間為650µs。 mktout
和mktout_if
運行時間約為220µs和120µs。 上面的mktout_full
運行大約68 µs。 我在做mktout_full
是優化中的if-else邏輯mktout_if
。 或許令人驚訝的,並行out_loop
通過codesurgeon結合的if-else在邏輯mktout_full
是較慢的方式(121ms)。
簡要地看一下代碼並嘗試對其進行cythonize處理,僅將ndarray類型添加到所有參數和變量中並不會顯着改變性能。 如果在這個緊迫的內部循環中為該功能節省微秒,我將考慮進行以下修改:
numpy
或numexpr
。 雖然這些操作本身都是有效的,但它們都會增加一些python開銷 (如果您查看cython可以生成的帶注釋的.html
文件,則可以看到)。 mktout
成為cdef
函數來節省一些時間。 Python函數調用的開銷很大。 math
模塊中的任何函數。 您可以from libc cimport math as cmath
替換from libc cimport math as cmath
而改用cmath.exp
。 mktout
函數接收一個python列表mean_mu_alpha
。 您可以考慮使用cdef class
對象替換此參數,然后鍵入此參數。 如果您選擇使mktout
成為cdef
函數,則它可以只是一個struct或double *
數組。 無論哪種方式,索引到python列表(可以包含需要取消裝箱成對應的c型的任意python對象)的索引都將很慢。 mktout
調用,您正在為許多數組分配內存(對於每個mu
, alpha
, threshold
, case
, t-
和p-
數組)。 然后,您可以在函數末尾(通過python的gc)釋放所有這些內存,只是可能在下一次調用時再次使用所有這些空間。 如果可以更改mktout
的簽名, mktout
可以將所有這些數組作為參數傳遞,以便可以在函數調用之間重用和覆蓋內存 。 對此情況更好的另一種選擇是遍歷數組並一次執行一個元素的所有計算。 prange
函數對代碼進行多線程prange
。 在完成所有上述更改之后,我會解決這個問題,並且我會在mktout
函數本身之外進行多線程處理。 也就是說,您將對mktout進行多線程調用,而不是對mktout
本身進行多線程mktout
。 進行上述更改將需要大量工作,並且您可能必須自己重新實現numpy和numexpr提供的許多功能,以避免與每次相關的python開銷。 如果有任何不清楚的地方,請告訴我。
更新#1:實施#1,#3和#5點后,我得到了11倍的提速 。 這是這段代碼的樣子。 我相信,如果您放棄def
函數, list mean_mu_alpha
輸入和tuple
輸出,它的運行速度將會更快。 注意:與原始代碼相比,我在最后一位小數位得到的結果略有不同,這可能是由於某些我不理解的浮點規則所致。
from libc cimport math as cmath
from libc.stdint cimport *
from libc.stdlib cimport *
def mktout(list mean_mu_alpha, double[:, ::1] errors, double par_gamma):
cdef:
size_t i, n
double[4] exp
double exp_par_gamma
double mu10, mu11, mu20, mu21
double alpha1, alpha2
bint j_is_larger, j_is_smaller
double threshold2, threshold3
bint case1, case2, case3, case4, case5, case6
double t0, t1, t2
double p12, p1, p2
double t1_sum, t2_sum, p1_sum, p2_sum
double c
#compute the exp outside of the loop
n = errors.shape[0]
exp[0] = cmath.exp(<double>mean_mu_alpha[0])
exp[1] = cmath.exp(<double>mean_mu_alpha[1])
exp[2] = cmath.exp(<double>mean_mu_alpha[2])
exp[3] = cmath.exp(<double>mean_mu_alpha[3])
exp_par_gamma = cmath.exp(par_gamma)
c = 168.0
t1_sum = 0.0
t2_sum = 0.0
p1_sum = 0.0
p2_sum = 0.0
for i in range(n):
mu10 = errors[i, 0] * exp[0]
mu11 = exp_par_gamma * mu10
mu20 = errors[i, 1] * exp[1]
mu21 = exp_par_gamma * mu20
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
j_is_larger = mu10 > mu20
j_is_smaller = not j_is_larger
threshold2 = (1 + mu10 * alpha1) / (c + alpha1)
threshold3 = (1 + mu20 * alpha2) / (c + alpha2)
case1 = j_is_larger * (mu10 < 1 / c)
case2 = j_is_larger * (mu21 >= threshold2)
case3 = j_is_larger ^ (case1 | case2)
case4 = j_is_smaller * (mu20 < 1 / c)
case5 = j_is_smaller * (mu11 >= threshold3)
case6 = j_is_smaller ^ (case4 | case5)
t0 = case1*c+case2 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) +case3 / threshold2 +case4 * c +case5 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case6 / threshold3
t1 = case2 * (t0 * alpha1 * mu11 - alpha1) +case3 * (t0 * alpha1 * mu10 - alpha1) +case5 * (t0 * alpha1 * mu11 - alpha1)
t2 = c - t0 - t1
p12 = case2 + case5
p1 = case3 + p12
p2 = case6 + p12
t1_sum += t1
t2_sum += t2
p1_sum += p1
p2_sum += p2
return t1_sum/n, t2_sum/n, p1_sum/n, p2_sum/n
更新#2:實現了cdef
(#2),python對象消除(#4)和多線程(#6)想法。 #2和#4的好處微不足道,但對於#6卻是必需的,因為無法在OpenMP prange
循環中訪問GIL。 有了多線程,您的四核筆記本電腦將獲得2.5倍的額外速度提速,相當於比原始速度快27.5倍的代碼。 我的outer_loop
函數雖然不能完全准確,因為它會outer_loop
重新計算相同的結果,但是對於一個測試用例來說應該足夠了。 完整的代碼如下:
from libc cimport math as cmath
from libc.stdint cimport *
from libc.stdlib cimport *
from cython.parallel cimport prange
def mktout(list mean_mu_alpha, double[:, ::1] errors, double par_gamma):
cdef:
size_t i, n
double[4] exp
double exp_par_gamma
double mu10, mu11, mu20, mu21
double alpha1, alpha2
bint j_is_larger, j_is_smaller
double threshold2, threshold3
bint case1, case2, case3, case4, case5, case6
double t0, t1, t2
double p12, p1, p2
double t1_sum, t2_sum, p1_sum, p2_sum
double c
#compute the exp outside of the loop
n = errors.shape[0]
exp[0] = cmath.exp(<double>mean_mu_alpha[0])
exp[1] = cmath.exp(<double>mean_mu_alpha[1])
exp[2] = cmath.exp(<double>mean_mu_alpha[2])
exp[3] = cmath.exp(<double>mean_mu_alpha[3])
exp_par_gamma = cmath.exp(par_gamma)
c = 168.0
t1_sum = 0.0
t2_sum = 0.0
p1_sum = 0.0
p2_sum = 0.0
for i in range(n):
mu10 = errors[i, 0] * exp[0]
mu11 = exp_par_gamma * mu10
mu20 = errors[i, 1] * exp[1]
mu21 = exp_par_gamma * mu20
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
j_is_larger = mu10 > mu20
j_is_smaller = not j_is_larger
threshold2 = (1 + mu10 * alpha1) / (c + alpha1)
threshold3 = (1 + mu20 * alpha2) / (c + alpha2)
case1 = j_is_larger * (mu10 < 1 / c)
case2 = j_is_larger * (mu21 >= threshold2)
case3 = j_is_larger ^ (case1 | case2)
case4 = j_is_smaller * (mu20 < 1 / c)
case5 = j_is_smaller * (mu11 >= threshold3)
case6 = j_is_smaller ^ (case4 | case5)
t0 = case1*c+case2 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) +case3 / threshold2 +case4 * c +case5 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case6 / threshold3
t1 = case2 * (t0 * alpha1 * mu11 - alpha1) +case3 * (t0 * alpha1 * mu10 - alpha1) +case5 * (t0 * alpha1 * mu11 - alpha1)
t2 = c - t0 - t1
p12 = case2 + case5
p1 = case3 + p12
p2 = case6 + p12
t1_sum += t1
t2_sum += t2
p1_sum += p1
p2_sum += p2
return t1_sum/n, t2_sum/n, p1_sum/n, p2_sum/n
ctypedef struct Vec4:
double a
double b
double c
double d
def outer_loop(list mean_mu_alpha, double[:, ::1] errors, double par_gamma, size_t n):
cdef:
size_t i
Vec4 mean_vec
Vec4 out
mean_vec.a = <double>(mean_mu_alpha[0])
mean_vec.b = <double>(mean_mu_alpha[1])
mean_vec.c = <double>(mean_mu_alpha[2])
mean_vec.d = <double>(mean_mu_alpha[3])
with nogil:
for i in prange(n):
cy_mktout(&out, &mean_vec, errors, par_gamma)
return out
cdef void cy_mktout(Vec4 *out, Vec4 *mean_mu_alpha, double[:, ::1] errors, double par_gamma) nogil:
cdef:
size_t i, n
double[4] exp
double exp_par_gamma
double mu10, mu11, mu20, mu21
double alpha1, alpha2
bint j_is_larger, j_is_smaller
double threshold2, threshold3
bint case1, case2, case3, case4, case5, case6
double t0, t1, t2
double p12, p1, p2
double t1_sum, t2_sum, p1_sum, p2_sum
double c
#compute the exp outside of the loop
n = errors.shape[0]
exp[0] = cmath.exp(mean_mu_alpha.a)
exp[1] = cmath.exp(mean_mu_alpha.b)
exp[2] = cmath.exp(mean_mu_alpha.c)
exp[3] = cmath.exp(mean_mu_alpha.d)
exp_par_gamma = cmath.exp(par_gamma)
c = 168.0
t1_sum = 0.0
t2_sum = 0.0
p1_sum = 0.0
p2_sum = 0.0
for i in range(n):
mu10 = errors[i, 0] * exp[0]
mu11 = exp_par_gamma * mu10
mu20 = errors[i, 1] * exp[1]
mu21 = exp_par_gamma * mu20
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
j_is_larger = mu10 > mu20
j_is_smaller = not j_is_larger
threshold2 = (1 + mu10 * alpha1) / (c + alpha1)
threshold3 = (1 + mu20 * alpha2) / (c + alpha2)
case1 = j_is_larger * (mu10 < 1 / c)
case2 = j_is_larger * (mu21 >= threshold2)
case3 = j_is_larger ^ (case1 | case2)
case4 = j_is_smaller * (mu20 < 1 / c)
case5 = j_is_smaller * (mu11 >= threshold3)
case6 = j_is_smaller ^ (case4 | case5)
t0 = case1*c+case2 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) +case3 / threshold2 +case4 * c +case5 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case6 / threshold3
t1 = case2 * (t0 * alpha1 * mu11 - alpha1) +case3 * (t0 * alpha1 * mu10 - alpha1) +case5 * (t0 * alpha1 * mu11 - alpha1)
t2 = c - t0 - t1
p12 = case2 + case5
p1 = case3 + p12
p2 = case6 + p12
t1_sum += t1
t2_sum += t2
p1_sum += p1
p2_sum += p2
out.a = t1_sum/n
out.b = t2_sum/n
out.c = p1_sum/n
out.d = p2_sum/n
我使用的setup.py
文件如下(具有所有優化和OpenMP標志):
from distutils.core import setup
from Cython.Build import cythonize
from distutils.core import Extension
import numpy as np
import os
import shutil
import platform
libraries = {
"Linux": [],
"Windows": [],
}
language = "c"
args = ["-w", "-std=c11", "-O3", "-ffast-math", "-march=native", "-fopenmp"]
link_args = ["-std=c11", "-fopenmp"]
annotate = True
directives = {
"binding": True,
"boundscheck": False,
"wraparound": False,
"initializedcheck": False,
"cdivision": True,
"nonecheck": False,
"language_level": "3",
#"c_string_type": "unicode",
#"c_string_encoding": "utf-8",
}
if __name__ == "__main__":
system = platform.system()
libs = libraries[system]
extensions = []
ext_modules = []
#create extensions
for path, dirs, file_names in os.walk("."):
for file_name in file_names:
if file_name.endswith("pyx"):
ext_path = "{0}/{1}".format(path, file_name)
ext_name = ext_path \
.replace("./", "") \
.replace("/", ".") \
.replace(".pyx", "")
ext = Extension(
name=ext_name,
sources=[ext_path],
libraries=libs,
language=language,
extra_compile_args=args,
extra_link_args=link_args,
include_dirs = [np.get_include()],
)
extensions.append(ext)
#setup all extensions
ext_modules = cythonize(
extensions,
annotate=annotate,
compiler_directives=directives,
)
setup(ext_modules=ext_modules)
"""
#immediately remove build directory
build_dir = "./build"
if os.path.exists(build_dir):
shutil.rmtree(build_dir)
"""
更新#3:根據@ GZ0的建議,在許多情況下,代碼中的表達式將計算為零並被浪費地計算。 我嘗試使用以下代碼消除這些區域(在修復case3
和case6
語句之后):
cdef void cy_mktout_if(Vec4 *out, Vec4 *mean_mu_alpha, double[:, ::1] errors, double par_gamma) nogil:
cdef:
size_t i, n
double[4] exp
double exp_par_gamma
double mu10, mu11, mu20, mu21
double alpha1, alpha2
bint j_is_larger
double threshold2, threshold3
bint case1, case2, case3, case4, case5, case6
double t0, t1, t2
double p12, p1, p2
double t1_sum, t2_sum, p1_sum, p2_sum
double c
#compute the exp outside of the loop
n = errors.shape[0]
exp[0] = cmath.exp(mean_mu_alpha.a)
exp[1] = cmath.exp(mean_mu_alpha.b)
exp[2] = cmath.exp(mean_mu_alpha.c)
exp[3] = cmath.exp(mean_mu_alpha.d)
exp_par_gamma = cmath.exp(par_gamma)
c = 168.0
t1_sum = 0.0
t2_sum = 0.0
p1_sum = 0.0
p2_sum = 0.0
for i in range(n):
mu10 = errors[i, 0] * exp[0]
mu11 = exp_par_gamma * mu10
mu20 = errors[i, 1] * exp[1]
mu21 = exp_par_gamma * mu20
alpha1 = errors[i, 2] * exp[2]
alpha2 = errors[i, 3] * exp[3]
j_is_larger = mu10 > mu20
j_is_smaller = not j_is_larger
threshold2 = (1 + mu10 * alpha1) / (c + alpha1)
threshold3 = (1 + mu20 * alpha2) / (c + alpha2)
if j_is_larger:
case1 = mu10 < 1 / c
case2 = mu21 >= threshold2
case3 = not (case1 | case2)
t0 = case1*c + case2 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case3 / threshold2
t1 = case2 * (t0 * alpha1 * mu11 - alpha1) + case3 * (t0 * alpha1 * mu10 - alpha1)
t2 = c - t0 - t1
t1_sum += t1
t2_sum += t2
p1_sum += case2 + case3
p2_sum += case2
else:
case4 = mu20 < 1 / c
case5 = mu11 >= threshold3
case6 = not (case4 | case5)
t0 = case4 * c + case5 * (c + alpha1 + alpha2) / (1 + mu11 * alpha1 + mu21 * alpha2) + case6 / threshold3
t1 = case5 * (t0 * alpha1 * mu11 - alpha1)
t2 = c - t0 - t1
t1_sum += t1
t2_sum += t2
p1_sum += case5
p2_sum += case5 + case6
out.a = t1_sum/n
out.b = t2_sum/n
out.c = p1_sum/n
out.d = p2_sum/n
對於10000次迭代,當前代碼執行如下:
outer_loop: 0.5116949229995953 seconds
outer_loop_if: 0.617649456995423 seconds
mktout: 0.9221872320049442 seconds
mktout_if: 1.430276553001022 seconds
python: 10.116664300003322 seconds
我認為導致結果的條件和分支錯誤預測的代價出乎意料的是使該函數變慢,但是我希望能為您清除此問題提供任何幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.