I am trying to find a faster way to run numpy/sklearn to do some task on Lists of Data. I got some books which suggest me to use Process rather than Thread in Heavy data computing jobs. While doing this I find that threads run faster than Process. Why is that? Which way should I choose?
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 2 10:20:19 2019
@author: Simon
"""
import time
import numpy as np
from sklearn import linear_model
from concurrent.futures import ProcessPoolExecutor as Pool
from concurrent.futures import ThreadPoolExecutor as Pool
xx, yy = np.meshgrid(np.linspace(0,10,1000), np.linspace(10,100,1000))
zz = 1.0 * xx + 3.5 * yy + np.random.randint(0,100,(1000,1000))
X, Z = np.column_stack((xx.flatten(),yy.flatten())), zz.flatten()
regr = linear_model.LinearRegression()
def regwork(t):
X=t[0]
Z=t[1]
regr.fit(X, Z)
a, b = regr.coef_, regr.intercept_
return a
def numpywork(t):
X=t[0]
Z=t[1]
for i in range(1):
r=np.sum(X,axis=1)+np.log(Z)
return np.sum(r)
if __name__=="__main__":
r=regx((X,Z))
rlist=[[X,Z]]*500
start=time.clock()
pool = Pool(max_workers=2)
results = pool.map(numpywork, rlist)
for ret in results:
print(ret)
print(time.clock()-start)
Run on Win7-4 Real Core-I5-4700 with python 3.6. Here is the output:
Ways|Workerjob|Process Num showed in taskmgr|Cpu loads while working|Time cost
2threads|numpy |1 process|100%|9s
2threads|sklearn|1 process|100%|35s
2process|numpy |3 process|100%|36s
2process|sklearn|3 process|100%|77s
Why process cost more time? How to find a better way to lower the time cost and make full use of the multi-core OS?
OK. I have got it. For those modules that could release GIL like numpy, Using Thread backend will save time by reducing the Np object copy cost from main process to sub-process.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.