I'm playing with pyzmq for inter-process transfer of 4k HDR image data and noticed that this:
byt = np.array2string(np.random.randn(3840,2160,3)).encode()
while True:
socket.send(byt)
is much much faster than:
byt = np.random.randn(3840,2160,3).asbytes()
while True:
socket.send(byt)
Can someone explain why? I can't seem to wrap my head around it.
Q: Why is it faster sending... ? Can someone explain why ?
A:
+1 for having asked WHY -
people who do understand WHY are those, that strive to learn to the roots of the problems, so as to truly understand the core reasons & thus can next design better systems, knowing the very WHY ( taking no shortcuts in mimicking emulating or copy/paste following someone else )
So, let's start:
HDR is not the SDR,
we will have "a lot of DATA" here to acquire - store - process - send,
.send()
, who gets faster & WHY DATA:
were defined to be 4K-HDR sized array of triple-data-values of a numpy
provided default dtype
, where ITU-T Recommendation BT-2100 HDR colourspace requires at least 10-bit for increased colour dynamics-ranges
The as-is code delivers numpy.random.randn( c4K, r4K, 3 )
's default dtype
of np.float64
. Just for the sake of proper & right-sized system design, the HDR ( extending a plain 8-bit sRGB triple-byte colourspace ) shall always prefer int{10|12|16|32|...}
-based storage, not to skew any numerical image post-processing in pipeline's later phase(s).
process:
Actual message-payload generating processes were defined to be
Case-A ) np.array2string( )
followed by an .encode()
method
Case-B ) a numpy.ndarray
-native (sic) .asbytes()
-method
.send()
:
ZeroMQ Scalable Formal Communication Archetype pattern (of unknown type) finally receives a process-generated message-payload, into a ( blocking -form of the ) .send()
-method
The core difference is hidden in the fact, that we try to compare apples to oranges.
>>> len( np.random.randn( c4K, r4K, 3 ).tobytes() ) / 1E6
199.0656 [MB]
>>> len( np.array2string( np.random.randn( c4K, r4K, 3 ) ) ) / 1E6
0.001493 [MB] ... Q.E.D.
While the (sic) .asbytes()
-method produces a full copy ( incl. RAM-allocation + RAM-I/O-traffic [SPACE]
+ [TIME]
-domains' costs ), ie spending some extra us
before ZeroMQ starts a .send()
-method ZeroCopy magicks:
print( np.random.randn( c4K, r4K, 3 ).tobytes.__doc__ ) a.tobytes(order='C') Construct Python bytes containing the raw data bytes in the array. Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object is produced in C-order by default. This behavior is controlled by the ``order`` parameter. .. versionadded:: 1.9.0
the other case, the Case-A , first throws away (!) , and a lot (!) ... depending here on actual numpy
matrix-UI-presentation configuration settings , lot of original 4K-HDR DATA even before moving them into the .encode()
-phase:
>>> print( np.array2string( np.random.randn( c4K, r4K, 3 ) ) ) [[[ 1.54482944 -0.23189048 -0.67866246]... [ 0.13461456 1.47855833 -1.68885902]] [[-0.18963557 -1.1869201 1.34843493]... [-0.3022641 -0.44158803 0.75750368]] [[-1.05737969 0.864752 0.36359686]... [ 1.70240612 -0.12574642 -1.03325878]]... [[ 0.41776933 1.73473723 0.28723299]... [-0.47635911 0.15901325 -0.56407537]] [[-1.41571874 1.66735309 0.6259928 ]... [-0.93164127 0.95708002 1.3470873 ]] [[ 0.16426176 -0.00317156 0.77522962]... [ 0.32960196 -1.74369368 -0.34177759]]]
So, sending less-DATA means taking less time to move them.
Tips HOW:
ZeroMQ methods & the overall performance will benefit from using zmq.DONTWAIT
flag, when passing a reference to the .send()
-method
try to re-use the most of the great numpy
-tooling, where possible, to minimise repetitive RAM-allocation(s) ( we may pre-allocate & re-use once allocated variable )
try to use as compact DATA-representation as possible, if hunting for maximum performance with minimum latency - redundancy-avoided, compact, CPU-cache-lines' hierarchy & associativity matching formats will always win in the race for ultimate performance ( using a view of internal numpy
-storage area, ie without using any mediating methods to read-access the actual block of 4K-HDR data may help to move the whole pipeline to become ZeroCopy down to the ZeroMQ .send()
-pushing the DATA-references only ( ie without copying or moving a single byte of DATA from / into RAM, up until loading it onto the wire... )... which is the coolest performance result of our design efforts here, isn't it? )
in any case, in all critical sections, avoid effects of blocking the flow by gc.disable()
, to at least defer a potential .collect()
not to happen "here"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.