简体   繁体   中英

Numpy array to ctypes with FORTRAN ordering

Is there a performant way to convert a numpy array to a FORTRAN ordered ctypes array, ideally without a copy being required, and without triggering problems related to strides?

import numpy as np

# Sample data
n = 10000
A = np.zeros((n,n), dtype=np.int8)
A[0,1] = 1

def slow_conversion(A):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(A.T))

assert slow_conversion(A)[1][0] == 1

Performance analysis of just the call to as_ctypes:

%%timeit
np.ctypeslib.as_ctypes(A)

3.35 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Performance analysis of the supplied (slow) conversion

%%timeit
slow_conversion(A)

206 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The ideal outcome would be to get performance similar to that of just the as_ctypes call.

Requirements:

  • numpy array in column main order (Fortran or 'F' order)
  • fast conversion to ctypes type
  • avoid problems with strides

One possible way would be to create the array already with the internal Fortran memory layout:

A = np.zeros((n, n), dtype=np.int8, order='F')

Then the conversion could look like this:

def fast_conversion(arr):
    return np.ctypeslib.as_ctypes(arr.flatten('F').reshape(arr.shape))

You could omit .reshape(arr.shape) if you only need a one-dimensional array - but in terms of performance, there should be no difference.

How does it work?

arr.flatten('F') return the array collapsed into one dimension. Since we have an array with F order this is fast. Afterwards with the reshape we apply the shape of the array back to it without changing its data. BTW: Since we are working with an F order array we could also use arr.flatten('K') , about which the documentation says:

'K' means to flatten a in the order the elements occur in memory.

see https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html

It is important that the array must be created with the F order. Otherwise the fast_conversion would be as slow as the slow_conversion .

Test

import timeit
import numpy as np

# Sample data
n = 10000
A = np.zeros((n, n), dtype=np.int8)
A[0, 1] = 1

B = np.zeros((n, n), dtype=np.int8, order='F')
B[0, 1] = 1


def slow_conversion(arr):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(arr.T))


def fast_conversion(arr):
    return np.ctypeslib.as_ctypes(arr.flatten('F').reshape(arr.shape))


assert slow_conversion(A)[1][0] == 1
assert fast_conversion(B)[1][0] == 1

loops = 10
slow_result = timeit.timeit(lambda: slow_conversion(A), number=loops)
print(f'slow: {slow_result / loops}')

fast_result = timeit.timeit(lambda: fast_conversion(B), number=loops)
print(f'fast: {fast_result / loops}')

The test gives as output:

slow: 0.45553940839999996
fast: 0.02264067879987124

The fast version is therefore about 20 times faster than the slow version.

As you've pointed out np.ctypeslib.as_ctypes(...) is fast.

The bottleneck of your computation is in np.ascontiguousarray(AT) - it is equivalent to np.asfortranarray(A) , which is equally as slow on large arrays.


This leads me to believe that this can't be made faster using only numpy functions. I mean, since a whole dedicated function to do it already exists - we'd assume it has the best possible performance.

By default, Numpy creates C-ordered arrays (because it is written in C), that is, row-major arrays. Doing a transposition with AT creates a view of the array where the strides are reversed (ie. no copy). That being said, np.ascontiguousarray does a copy because the array is now not contiguous anymore and copies are expensive. This is why slow_conversion is slow. Note that the contiguity can be tested with yourarray.flags['F_CONTIGUOUS'] and by checking yourarray.strides . Note also that yourarray.flags and yourarray.__array_interface__ provide information about whether the array has been copied and also information about strides.

np.asfortranarray returns an array laid out in Fortran order in memory regarding the documentation. It can perform a copy if needed. In fact, np.asfortranarray(A) does a copy while np.asfortranarray(AT) does not. You can check the C code of the function for more information about this behaviour. Since both are seen as FORTRAN-contiguous, it is better to use np.asfortranarray(AT) that do not make any copy in this case.

Regarding ctypes, it deals with C arrays which are stored with a row-major ordering as opposed to FORTRAN array that are stored with a column-major ordering. Furthermore, C arrays do not support strides natively as opposed to FORTRAN ones. This means a row is basically a memory view of contiguous data stored in memory. Since slow_conversion(A)[1][0] == 1 is required to be true, this means the returned object should have the first item of the second row is 1 and thus that the columns are mandatory stored contiguously in memory. The thing is that the initial array is not FORTRAN-contiguous but C-contiguous and thus a transposition is required . Transpositions are pretty expensive (although the Numpy implementation is suboptimal).

Assuming you do not want to pay the overhead of a copy/transposition, the problem need to be relaxed. There are few possible options:

  • Create FORTRAN ordered array directly with Numpy using for example np.zeros((n,n), dtype=np.int8, order='F') . This create a C array with transposed strides so to behave like a FORTRAN array where computations operating on columns are fast (remember that Numpy is written in C so row-major ordered array are the reference). With this, the first row in ctypes is actually a column. Note that you should be very careful when mixing C and FORTRAN ordered array for sake of performance as non-contiguous access are much slower.
  • Use a strided FORTRAN array. This solution basically means that basic column-based computations will be much slower and one need to write row-based computation that are quite unusual in FORTRAN. You can extract the pointer to the C-contiguous array with A.ctypes.data_as(POINTER(c_double)) , the strides with A.strides and the shape with A.shape . That being said, this solution appears not to be really portable/standard. The standard way appears to use C binding in FORTRAN. I am not very familiar with this but you can find a complete example in this answer .

A last solution is to transpose data in-place manually using a fast transposition algorithm. This is faster than an out-of-place transposition but this requires a squared array and it cannot be done using Numpy directly. Moreover, it mutate the input array that should not be used later (unless it is fine to operate on a transposed array). One solution is to do in in Numba , or to do it in C or directly in FORTRAN (using a wrapper function in all cases). This should be significantly faster than what Numpy does but still far slower than a basic ctypes wrapping.

There is one aspect that can be improved about it.

The operation is not only making a copy, but because it is loading and storing to main memory instead of using cache.

Usually processors will access multiple blocks multiple contiguous bytes whenever memory and use it from cache. If you run out of cache space some old block is evicted.

For sake of the argument let's say that your CPU works on blocks of 8 bytes and that the rows are contiguous. In one matrix you will access columns and the other you will access rows. When you are writing down a column you are loading multiple columns but updating only one. This overhead of one single column can be seen by copying a few columns

n = 2**14
A = np.random.randint(0, 100, (n,n), dtype=np.int8)
B = np.empty_like(A)
%%timeit
B[:1,:] = A[:1,:]
%%timeit
B[:4,:] = A[:4,:]

If you do the same on the rows you should notice something roughly linear. If you copy columns the cost of copying one column is very close to the cost of copying two columns or even 8, or 16, depending on the hardware.

I will use n=2**14 to make the things easier, but the principle applies to any dimension.

  • If you have a small enough let's say 8 x 8 the entire matrix fit in the cache, so you can transpose it without accessing any cache.
  • If you have are copying large continuous block of data even if you can't do the entire operation on cache you reduce the number of times a given data will be loaded from/to memory again.

Based on this what I tried is to rearrange the matrix in a matrix of smaller contiguous blocks, first I transpose the elements in a block and then the blocks in the matrix.

For the baseline

B = np.ascontiguousarray(A.T)
3.12 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using 8x8 blocks

T0 = A.reshape(2048,8,2048,8)
T1 = np.ascontiguousarray(T0.transpose(0,2,3,1))
T2 = np.ascontiguousarray(T1.transpose(1,0,2,3))
T3 = np.ascontiguousarray(T2.transpose(0,2,1,3))
B = T3.reshape(A.shape)
786 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
assert np.all(B == A.T) # 2.8s

It is still 200x slower than a simple copy, but it is already 4x faster than the original approach.

Allocating only two instead of 3 temporary arrays help as well

T0 = np.empty_like(A)
T1 = np.empty_like(A)
T0.reshape(2048,2048,8,8)[:] = A.reshape(2048,8,2048,8).transpose(0,2,3,1)
T1.reshape(2048,2048,8,8)[:] = T0.reshape(2048,2048,8,8).transpose(1,0,2,3)
T0.reshape(2048,8,2048,8)[:] = T1.reshape(2048,2048,8,8).transpose(0,2,1,3)
B = T0
686 ms ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You have to check this website I'm sure you will get your answer https://leherchat.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM