简体   繁体   中英

Frequencies of elements in 2D numpy array

I have a numpy array output of shape (1000,4) . It is an array which contains 1000 quadruples with no repetitions and they are ordered (ie an element is [0,1,2,3]). I want to count how many times I got all possible quadruples. More practically, I use the following code:

comb=np.array(list(itertools.combinations(range(32),4)))
def counting(comb, output):
    k=0
    n_output=np.zeros(comb.shape[0])
    for i in range(comb.shape[0]):
        k=0
        for j in range(output.shape[0]):
            if (output[j]==comb[i]).all():
                k+=1
        n_output[i]=k
    return n_output

How can I optimize the code? At the moment it takes 30 s to run

Your current implementation is inefficient for 2 reasons:

  • the complexity of the algorithm is O(n^2) ;
  • it makes use of (slow CPython) loops.

You write a simple O(n) algorithm using Python sets (still with a loop) since output does not have any repetitions. Here is the result:

def countingFast(comb, output):
    k=0
    n_output=np.zeros(comb.shape[0])
    tmp = set(map(tuple, output))
    for i in range(comb.shape[0]):
        n_output[i] = int(tuple(comb[i]) in tmp)
    return n_output

On my machine, using the described input sizes, the original version takes 55.2 seconds while this implementation takes 0.038 second. This is roughly 1400 times faster .

You can generate a boolean array representing if the sequence you want to check is equal to a given row in your array. As numpy's boolean arrays can be summed, you could then use this result to get the total number of matching rows.

A basic approach could look like this (including sample data generation):

import numpy as np

# set seed value of random generator to fixed value for repeatable output
np.random.seed(1234)

# create a random array with 950x4 elements
arr = np.random.rand(950, 4)

# create a 50x4 array with sample sequence
# this is the sequence we want to count in our final array
sequence = [0, 1, 2, 3]
sample = np.array([sequence, ]*50)

# stack arrays to create sample data with 1000x4 elements
arr = np.vstack((arr, sample))

# shuffle array to get a random distribution of random sample data and known sequence
np.random.shuffle(arr)

# check for equal array elements, returns a boolean array
results = np.equal(sequence, arr)

# sum the boolean array to get the number of total occurences per axis
# as the sum is the same for all columns, we just need to get the first element at index 0
occurences = np.sum(results, axis=0)[0]

print(occurences)
# --> 50

You need to call the required lines for each of sequence you are interested in. Therefore, it would be useful to write a function like this:

def number_of_occurences(data, sequence):
    results = np.equal(sequence, data)
    return np.sum(results, axis=0)[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM