[英]Is the backend source code for numpy.multiply() setup for multiprocessing/multithreading?
我找不到描述np.multiply()
算法是如何編寫的源代碼或任何文檔。
我在手冊中找不到它:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html
https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.multiply.html
有誰知道np.multiply()
的后端源代碼是否為多處理/多線程設置? 我問的原因是因為我正在編寫自己的代碼來計算使用並行編程( joblib.Parallel
)的 Kronecker 產品,但是當我測試速度時間時, np.kron()
(使用np.multiply()
)仍然運行速度比我的具有並行編程的代碼快。
編輯:
這是我為我的 Kronecker 產品編寫的代碼:
from itertools import product
from joblib import Parallel, delayed
from functools import reduce
from operator import mul
import numpy as np
lst = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr = np.array(lst)
n = 2
def test1(arr, n):
flat = np.ravel(arr).tolist()
gen = (list(a) for a in product(flat, repeat=n))
results = Parallel(n_jobs=-1)(delayed(reduce)(mul, x) for (x) in gen)
nrows = arr.shape[0]
ncols = arr.shape[1]
arr_multi_dim = np.array(results).reshape((nrows, ncols)*n)
arr_final = np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1)
return arr_final
你在這里和這里的努力仍然嘗試在附加開銷上花費更多(由於過程實例化成本和參數傳遞的數據分發成本(對於那里的遠程計算步驟和結果返回和合並),然后進入非常相反的方向,比numpy一個。
效率在numpy
,由於精心設計的無 GIL 核心設計,它還可以使用矢量化處理(即在一個 CPU 指令步驟中計算更多的東西 - 由於已知使用對齊、ILP 和AVX 類似的處理器儀器)。
鑒於這些強大的優勢 + numpy
-smart 就地/零復制處理(重新使用 L1/L2/L3 緩存數據比任何類型的嘗試設置和操作一組分布式的要快許多數量級 -處理,必須為 SER/DES 上的每個 RAM-copy + IPC-marshall + SER/DES 上的 RAM-copy + 計算 + SER/DES 上的 RAM-copy + IPC-marshall + SER/DES 上的 RAM-copy 支付額外成本),基於numpy
的智能代碼幾乎在所有情況下都會擊敗任何其他嘗試做同樣的事情。
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with a Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
?~~~ 2,500,000 ns - Read 10 MB sequentially from MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
?~~ 25,000,000 ns - Read 100 MB sequentially from MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
30,000,000 ns - Read 1 MB sequentially from a DISK
?~~ 36,000,000 ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.