I have a numpy array output
of shape (1000,4)
. It is an array which contains 1000 quadruples with no repetitions and they are ordered (ie an element is [0,1,2,3]). I want to count how many times I got all possible quadruples. More practically, I use the following code:
comb=np.array(list(itertools.combinations(range(32),4)))
def counting(comb, output):
k=0
n_output=np.zeros(comb.shape[0])
for i in range(comb.shape[0]):
k=0
for j in range(output.shape[0]):
if (output[j]==comb[i]).all():
k+=1
n_output[i]=k
return n_output
How can I optimize the code? At the moment it takes 30 s to run
Your current implementation is inefficient for 2 reasons:
O(n^2)
; You write a simple O(n)
algorithm using Python sets (still with a loop) since output
does not have any repetitions. Here is the result:
def countingFast(comb, output):
k=0
n_output=np.zeros(comb.shape[0])
tmp = set(map(tuple, output))
for i in range(comb.shape[0]):
n_output[i] = int(tuple(comb[i]) in tmp)
return n_output
On my machine, using the described input sizes, the original version takes 55.2 seconds while this implementation takes 0.038 second. This is roughly 1400 times faster .
You can generate a boolean array representing if the sequence you want to check is equal to a given row in your array. As numpy's boolean arrays can be summed, you could then use this result to get the total number of matching rows.
A basic approach could look like this (including sample data generation):
import numpy as np
# set seed value of random generator to fixed value for repeatable output
np.random.seed(1234)
# create a random array with 950x4 elements
arr = np.random.rand(950, 4)
# create a 50x4 array with sample sequence
# this is the sequence we want to count in our final array
sequence = [0, 1, 2, 3]
sample = np.array([sequence, ]*50)
# stack arrays to create sample data with 1000x4 elements
arr = np.vstack((arr, sample))
# shuffle array to get a random distribution of random sample data and known sequence
np.random.shuffle(arr)
# check for equal array elements, returns a boolean array
results = np.equal(sequence, arr)
# sum the boolean array to get the number of total occurences per axis
# as the sum is the same for all columns, we just need to get the first element at index 0
occurences = np.sum(results, axis=0)[0]
print(occurences)
# --> 50
You need to call the required lines for each of sequence you are interested in. Therefore, it would be useful to write a function like this:
def number_of_occurences(data, sequence):
results = np.equal(sequence, data)
return np.sum(results, axis=0)[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.