简体   繁体   中英

Faster way to convert list of objects to numpy array

I am trying to optimize my code by removing for loops and list comprehension by using numpy arrays. In general the execution of the code is now faster, BUT there is a thing that bothers me a lot: converting my list with about 110000 elements to a numpy array takes most of the time of the program runtime (5 to 7 seconds, just to initialize an array!)

I have this

rec = np.array(records)

where records is a list of objects.

Is it possible to speed up the creation of this numpy array?

If your list is one dimensional, using np.fromiter() is faster than the typical np.array() . Benchmark for an integer list of size 10000 :

a = [1,2,3,4,5,6,7,8,9,10]*1000

%timeit np.array(a,dtype=np.int32)
456 µs ± 7.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.fromiter(a,dtype=np.int32,count=10000)
242 µs ± 6.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The way the python stores objects (such as the items in records) is not the same as for numpy. Therefore in order to create the numpy array, each element needs to be accessed and then converted.

As @anmol_uppoal's comment suggests, you should be looking to create a numpy array from the outset. For example

rec = np.zeros((SIZE_OF_ARRAY,))
# Set values of rec in the same way you created records, for instance
for i in range(100):
    rec[i] = i+1

Getting further optimisations would be linked to where the data is coming from -- if from a file, try storing in a numpy-format, rather than text format. If a database, consider saving the binary values (but this depends heavily on the rest of your application)

when reading data, it generally does not come as numpy arrays. netcdf library that has bindings java or python, to handle multivariable, multifile datasets in HDF5 or variants. It use an internal optimized array type, but not the numpy type, so conversion is unavoidable.

As example, it takes several minutes for conversion from a netcdf dataset. step by step timing shows this is only the conversion that takes the time

with CodeTimer("convert np.array"): tcc_obs = np.asarray(tcc_obs)

for one array of shape : (725, 759, 96) ~53 M elements Code block 'convert np.array' took: 161 s

If anybody has a way to get it better. Logically, it should be better to subset with boolean arrays the netcdf variable, then convert a smaller array, but the code use some numpy function as isin to find common indexes in two tables, which has no equivalent known to me for python arrays.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM