简体   繁体   English

是否可以在 Python 中并行化这个程序(辛普森法则)?

[英]Is it possible to parallelize this program (Simpson's Rule) in Python?

I am new to the parallelizing paradigm.我是并行化范式的新手。 I already have the algorithm working in its serial form but I can't parallelize it.我已经有了以串行形式工作的算法,但我无法对其进行并行化。

I have asked around and some have told me that the way my program is written can't be parallelized.我四处打听,有些人告诉我,我的程序编写方式无法并行化。

Here is my serial code, any help or recommendation will be kindly appreciated.这是我的串行代码,任何帮助或建议将不胜感激。

import numpy as np  #Importar la librería de numpy para usar sus funciones

if __name__ == "__main__":

  #Numero de elementos debe ser par
  N = 1_000_000_000#int(input("Número de iteraciones/intervalos (número par): "))
  
  #Declara la funcion a integrar
  f = lambda x : x*0.5
  
  #Limites
  a = 3#int(input("Dame el limite inferior: "))
  b = 10#int(input("Dame el limite superior: "))
  
  #delta de x
  dx = (b-a)/N
  
  #División de intervalos
  x = np.linspace(a,b,N+1) 
  
  y = f(x)
  
  #Suma de cada uno de los intervalos
  resultado = dx/3 * np.sum(y[0:-1:2] + 4*y[1::2] + y[2::2])
  
  print("N = " + str(N))
  print("El valor de la integral es: ")
  print(resultado)

Q : "Is it possible to parallelize this program (Simpson's Rule) in Python?"“是否可以在 Python 中并行化这个程序(辛普森法则)?”

A more important side of the same coin is not how to parallelise a program ( or its part ), but at what add-on costs' penalties that will actually happen.同一枚硬币更重要的一面不是如何并行化程序(或其部分),而是实际发生的附加成本的惩罚。 From several options how to parallelise, the best one should be selected.从如何并行化的几个选项中,应该选择最好的一个。 Here, due to large data/memory-footprints ( an adverse O( n ) -scaling in SpaceDOMAIN, due to sizes above V * 8 * 1E9 [B] (sum of memory-footprints of all objects used, incl. interim storage variables), that next, indirectly, increases the O( n ) -scaling in TimeDOMAIN (duration), due to memory-I/O volume & bottle-necking of all RAM-I/O channels available ) a kind of fine-grain parallelism, called vectorisation, seems to fit best, as it adds almost zero add-on costs, yet helping reduce the RAM-I/O costs for an extremely low-complexity f(x) -lambda, which makes memory pre-fetches of in-contiguous blocks (due to index-hopping) almost impossible to latency-mask.在这里,由于大数据/内存占用(SpaceDOMAIN 中的不利O( n )缩放,由于大小高于V * 8 * 1E9 [B] (所有使用的对象的内存占用总和,包括临时存储变量) ),接下来,由于内存 I/O 量和所有可用 RAM-I/O 通道的瓶颈,间接增加了 TimeDOMAIN(持续时间)中的O(n)缩放)一种细粒度并行性,称为矢量化,似乎最适合,因为它几乎增加了零附加成本,但有助于降低极低复杂度f(x) -lambda 的 RAM-I/O 成本,这使得内存预取- 连续块(由于索引跳跃)几乎不可能延迟屏蔽。

There are at least two places, where a code-conversion will help, most on CPU micro-architectures, for which their native vector-instructions can get harnessed in most recent versions for indeed an HPC-grade parallel execution of numpy -vectorised code.至少有两个地方,代码转换会有所帮助,大多数在 CPU 微架构上,对于这些地方,它们的本机向量指令可以在最新版本中得到利用,以实现numpy量化代码的 HPC 级并行执行。

A performance tuning mandatory disclaimer :性能调优强制免责声明:
Detailed profiling will show you, for each particular target code-execution platform { x86 |详细的分析将向您展示每个特定目标代码执行平台 { x86 | ARM |手臂 | ... } and its actual UMA / NUMA memory-I/O & CPU-core cache-hierarchy, where your code loses most of the time (and how successfully or how poor does the actual processor cache-hierarchy mask the real-costs of accessing vast footprints of RAM with all the adverse effects of the physical RAM memory-I/O costs as these do not, for 1E9-sizes, fit into CPU-core cache/registers ). ... } 及其实际的 UMA / NUMA 内存-I/O 和 CPU-核心缓存层次结构,您的代码大部分时间都在其中丢失(以及实际处理器缓存层次结构掩盖实际成本的成功程度或程度如何访问大量 RAM 空间以及物理 RAM 内存 I/O 成本的所有不利影响,因为对于 1E9 大小,这些不适合 CPU 核心缓存/寄存器)。

a )一种 )
we may spend less time in producing/storing interim objects and use the brave numpy -code smart-vectorised trick:我们可能会花更少的时间来生产/存储临时对象,并使用勇敢的numpy -code 智能矢量化技巧:

y = f( np.linspace( a,       # lo-bound
                    b,       # hi-bound
                    N + 1    #   steps
                    )        # no storage of a temporary, yet HUGE object x ~ 8[GB] in RAM
       )                     # +ask lambda to in-flight compute & create just y = f( x[:] )

If in doubt, feel free to read more about the costs of various kinds of computing / storage related access latencies .如果有疑问,请随时阅读有关各种计算/存储相关访问延迟的成本的更多信息。

b )乙)
we may reduce, at least some parts of the repetitive memory-access patterns, that ( as expressed above, due to sizes of about ~1E9, cannot remain in-cache and will have to be re-executed and re-fetched from physical RAM ) still meet the computation:我们至少可以减少重复内存访问模式的某些部分,即(如上所述,由于大约 1E9 的大小,不能保留在缓存中,必须重新执行并从物理 RAM 中重新获取) 仍满足计算:

# proper fusing of the code against non-even N is left for clarity of vectorised
#                  and left to the kind user
#_________________________________________________________________ double steps
resultado = ( 4 * np.sum( f( np.linspace( a + dx,       # lo-bound
                                          b - dx,       # hi-bound
                                          ( N / 2 ) + 1 #    inner double steps
                                          )
                             ) #--------- lambda on a smaller, RAM/cache-compact object
                          )  #----------- numpy-controls the summation over contiguous RAM
            + 2 * np.sum( f( np.linspace( a + dx + dx,  # lo-bound
                                          b - dx - dx,  # hi-bound
                                          ( N / 2 ) + 1 #    inner double steps
                                          )
                             ) #--------- lambda on a smaller, RAM/cache-compact object
                          )  #----------- numpy-controls the summation overcontiguous RAM
            + f( a )
            + f( b )
              ) * dx / 3

While the mock-up code does not aspire to solve all corner cases, the core-benefit arrives from using RAM-contiguous memory-layouts, that are very efficiently numpy.sum() -ed and from avoiding of replicated-visits to memory-areas, re-visited just due to imperatively dictated (non-contiguous) index-jumping ( numpy can optimise some of its own indexing so as to maximise memory-access pattern/cache-hits, yet the "outer", coded index-jumping is almost always beyond of the reach of such smart, but hard-wired, numpy -optimisation know-how ( the less a silicon-based thinking or a clairvoyance ;o) )虽然模型代码并不希望解决所有numpy.sum()情况,但核心优势来自使用 RAM 连续的内存布局,这是非常有效的numpy.sum() -ed 和避免重复访问内存 -区域,仅由于命令性(非连续)索引跳转而重新访问( numpy可以优化自己的一些索引以最大化内存访问模式/缓存命中,但“外部”编码索引跳转几乎总是超出这种智能但硬连线, numpy优化诀窍的范围(更不用说基于硅的思维或洞察力;o)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM