簡體   English   中英

numpy.multiply() 的后端源代碼是否為多處理/多線程設置?

[英]Is the backend source code for numpy.multiply() setup for multiprocessing/multithreading?

我找不到描述np.multiply()算法是如何編寫的源代碼或任何文檔。

我在手冊中找不到它:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html

https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.multiply.html

有誰知道np.multiply()的后端源代碼是否為多處理/多線程設置? 我問的原因是因為我正在編寫自己的代碼來計算使用並行編程( joblib.Parallel )的 Kronecker 產品,但是當我測試速度時間時, np.kron() (使用np.multiply() )仍然運行速度比我的具有並行編程的代碼快。

編輯:

這是我為我的 Kronecker 產品編寫的代碼:

from itertools import product
from joblib import Parallel, delayed
from functools import reduce
from operator import mul
import numpy as np

lst = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr = np.array(lst)
n = 2

def test1(arr, n):
    flat = np.ravel(arr).tolist()
    gen = (list(a) for a in product(flat, repeat=n))

    results = Parallel(n_jobs=-1)(delayed(reduce)(mul, x) for (x) in gen)

    nrows = arr.shape[0]
    ncols = arr.shape[1]

    arr_multi_dim = np.array(results).reshape((nrows, ncols)*n)
    arr_final = np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1)

    return arr_final

在這里這里的努力仍然嘗試在附加開銷上花費更多(由於過程實例化成本和參數傳遞的數據分發成本(對於那里的遠程計算步驟和結果返回和合並),然后進入非常相反的方向,比numpy一個。

效率在numpy ,由於精心設計的無 GIL 核心設計,它還可以使用矢量化處理(即在一個 CPU 指令步驟中計算更多的東西 - 由於已知使用對齊、ILP 和AVX 類似的處理器儀器)。

鑒於這些強大的優勢 + numpy -smart 就地/零復制處理(重新使用 L1/L2/L3 緩存數據比任何類型的嘗試設置和操作一組分布式的要快許多數量級 -處理,必須為 SER/DES 上的每個 RAM-copy + IPC-marshall + SER/DES 上的 RAM-copy + 計算 + SER/DES 上的 RAM-copy + IPC-marshall + SER/DES 上的 RAM-copy 支付額外成本),基於numpy的智能代碼幾乎在所有情況下都會擊敗任何其他嘗試做同樣的事情。


永遠不要忘記附加成本並學習和了解擴展的不利影響

             0.1 ns - NOP
             0.3 ns - XOR, ADD, SUB
             0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
             0.9 ns - JMP SHORT
             1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
           3~4   ns - CPU L2  CACHE reference           (2020/Q1)
             5   ns - CPU L1 iCACHE Branch mispredict
             7   ns - CPU L2  CACHE reference
            10   ns - DIV
            19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
            71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
           100   ns - MUTEX lock/unlock
           100   ns - own DDR MEMORY reference
           135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
           202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
           325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
        10,000   ns - Compress 1K bytes with a Zippy PROCESS
        20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
       250,000   ns - Read   1 MB sequentially from  MEMORY
       500,000   ns - Round trip within a same DataCenter
?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
    10,000,000   ns - DISK seek
    10,000,000   ns - Read   1 MB sequentially from  NETWORK
?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
    30,000,000   ns - Read 1 MB sequentially from a  DISK
?~~ 36,000,000   ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
   150,000,000   ns - Send a NETWORK packet CA -> Netherlands
  |   |   |   |
  |   |   | ns|
  |   | us|
  | ms|

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM