[英]How to speed up the 'for' loop in a python function?
我有一个函数var
。 我想知道通过利用系统拥有的所有处理器、内核和 RAM 内存通过多处理/并行处理在此函数中快速运行 for 循环(对于多个坐标:xs 和 ys)的最佳可能方法。
是否可以使用Dask
模块?
可以在此处找到pysheds
文档。
import numpy as np
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
for (x,y) in zip(xs,ys):
grid = Grid.from_raster('E:/data.tif', data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch', recursionlimit=1500, xytype='label')
....
....
results
您没有发布指向您的image1.tif
文件的链接,因此下面的示例代码使用来自https://github.com/mdbartos/pysheds的pysheds/data/dem.tif
基本思想是拆分输入参数, xs
和ys
在你的情况下,分成子集,然后给每个 CPU 一个不同的子集来处理。
main()
计算两次解,一次是顺序的,一次是并行的,然后比较每个解。 并行解决方案存在一些低效率,因为图像文件将由每个 CPU 读取,因此有改进的空间(即,读取并行部分之外的图像文件,然后将生成的grid
对象提供给每个实例)。
import numpy as np
from pysheds.grid import Grid
from dask.distributed import Client
from dask import delayed, compute
xs = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
ys = 25, 35, 45, 55, 65, 75, 85, 95, 105, 115, 125
def var(image_file, x_in, y_in):
grid = Grid.from_raster(image_file, data_name='map')
variable_avg = []
for (x,y) in zip(x_in,y_in):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable_avg.append( np.array(variable).mean() )
return(variable_avg)
def var_parallel(n_cpu, image_file, x_in, y_in):
tasks = []
for cpu in range(n_cpu):
x_in = xs[cpu::n_cpu] # eg, cpu = 0: x_in = (10, 40, 70, 100)
y_in = ys[cpu::n_cpu] #
tasks.append( delayed(var)(image_file, x_in, y_in) )
ans = compute(tasks)
# reassemble solution in the right order
par_avg = [None]*len(xs)
for cpu in range(n_cpu):
par_avg[cpu::n_cpu] = ans[0][cpu]
print('AVG (parallel) =',par_avg)
return par_avg
def main():
image_file = 'pysheds/data/dem.tif'
# sequential solution:
seq_avg = var(image_file, xs, ys)
print('AVG (sequential)=',seq_avg)
# parallel solution:
n_cpu = 3
dask_client = Client(n_workers=n_cpu)
par_avg = var_parallel(n_cpu, image_file, xs, ys)
dask_client.shutdown()
print('max error=',
max([ abs(seq_avg[i]-par_avg[i]) for i in range(len(seq_avg))]))
if __name__ == '__main__': main()
我尝试使用dask
在下面提供可重现的代码。 您可以添加pysheds
的主要处理部分或其中的任何其他函数,以便更快地并行迭代参数。
dask
模块的文档可以在这里找到。
import dask
from dask import delayed, compute
from dask.distributed import Client, progress
from pysheds.grid import Grid
client = Client(threads_per_worker=2, n_workers=2) #Choose the number of workers and threads per worker over here to deploy for your task.
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
#Firstly, a function has to be created, where the iteration of the parameters is involved.
def var(x,y):
grid = Grid.from_raster('data.tif', data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch', recursionlimit=1500, xytype='label')
...
...
return (result)
#Now calling the function in a 'dask' way.
lazy_results = []
for (x,y) in zip(xs,ys):
lazy_result = dask.delayed(var)(x,y)
lazy_results.append(lazy_result)
#Final command to execute the function var(x,y) and get the result.
dask.compute(*lazy_results)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.