简体   繁体   中英

How to convert 2D arrays in dictionary into one single array?

I have the following code:

import random
import numpy as np
import pandas as pd

num_seq = 100
len_seq = 20
nts = 4
sequences = np.random.choice(nts, size = (num_seq, len_seq), replace=True)
sequences = np.unique(sequences, axis=0) #sorts the sequences

d = {}
pr = 5

for i in range(num_seq):
    globals()['seq_' + str(i)] = np.tile(sequences[i,:],(pr,1))
    d['seq_' + str(i)] = np.tile(sequences[i,:],(pr,1))

pool = np.empty((0,len_seq),dtype=int)
for i in range(num_seq):
    pool = np.concatenate((pool,eval('seq_' +str(i))))

I want to convert the dictionary d into a Numpy array (or a dictionary with just one entry). My code works, producing pool . However, at bigger values for num_seq , len_seq and pr , it takes a very long time.

The execution time is critical, thus my question: is there a more efficient way of doing this?

Here is a list of important points:

  • np.concatenate runs in O(n) so your second loop is running in O(n^2) time. You can append the value to a list and np.vstack all the values in the end (in O(n) time).
  • accessing globals() is slow and known to be a bad practice (because it can easily break your code in nasty ways);
  • calling eval(...) is slow too and also unsafe, so avoid it;
  • the default CPython interpreter does not optimize duplicated expression (it recompute them).
  • You can use Cython to slightly speed up the code or Numba (note that the support of dictionaries is experimental in Numba).

Here is an example of faster code (in replacement of the second loop):

pool = np.vstack([d[f'seq_{i}'] for i in range(num_seq)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM