简体   繁体   中英

Filtering a numpy array by frequencies of certain elements in it

I have a numpy array and a dictionary similar to below:

arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}

For each key-value pair (k,v) in d , k should appear exactly v times in arr1 in its second column. Clearly that doesn't happen here.

So what I want to do is, from arr1 , I want to create another array where every element in its second column appears exactly the number of times it's supposed to according to d . In other words, my desired outcome is:

np.array([['a1','x'],['a2','x'],['a5','z']])

I can get my desired outcome using list comprehension:

ans = [[x1,x2] for x1,x2 in arr1 if np.count_nonzero(arr1==x2)==d[x2]]

but I was wondering if it was possible to do it only using numpy.

This does what you want:

import numpy as np

arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}

# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
result = arr1[list(map(lambda x: x[1] in keep, arr1))]

Quite possibly there's a more optimal way to do this in numpy, but I don't know how big the set you're applying to is, or how often you need to do this, to say whether looking for it is worth it.

Edit: Note that you need to scale things up to decide what is a good solution. Your original solution is great for toy examples, it outperforms both answers. But the numpy solution provided by @NewbieAF beats the rest handily if you scale up to what may be more realistic workloads:

from random import randint
from timeit import timeit
import numpy as np


def original(arr1, d):
    return [[x1, x2] for x1, x2 in arr1 if np.count_nonzero(arr1 == x2) == d[x2]]


def f1(arr1, d):
    # get the actual counts of values in arr1
    counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
    # determine what values to keep, as their count matches the desired count
    keep = [x for x in d if x in counts and d[x] == counts[x]]
    # filter down the array
    return arr1[list(map(lambda x: x[1] in keep, arr1))]


def f2(arr1, d):
    # create arrays from d
    keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
    # count the unique elements in arr1[:,1]
    unqs, cts = np.unique(arr1[:,1], return_counts=True)

    # only keep track of elements that appear in arr1
    mask = np.isin(keys,unqs)
    keys, vals = keys[mask], vals[mask]

    # sort the unique values and corresponding counts according to keys
    idx1 = np.argsort(np.argsort(keys))
    idx2 = np.argsort(unqs)
    unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]

    # filter values by whether the counts match
    correct = unqs[vals==cts]

    return arr1[np.isin(arr1[:,1],correct)]


def main():
    arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
    d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}

    print(timeit(lambda: original(arr1, d), number=10000))
    print(timeit(lambda: f1(arr1, d), number=10000))
    print(timeit(lambda: f2(arr1, d), number=10000))

    counts = [randint(1, 3) for _ in range(10000)]
    arr1 = np.array([['x', f'{n}'] for n in range(10000) for _ in range(counts[n])])
    d = {f'{n}': randint(1, 3) for n in range(10000)}

    print(timeit(lambda: original(arr1, d), number=10))
    print(timeit(lambda: f1(arr1, d), number=10))
    print(timeit(lambda: f2(arr1, d), number=10))

main()

Result:

0.14045359999999998
0.2402685
0.5027185999999999
46.7569239
5.893172499999999
0.08729539999999503

The numpy solution is slow on a toy example, but orders of magnitude faster on a large input. Your solution seems pretty good, but loses out to the non-numpy solution avoiding the extra calls when scaled up.

Consider the size of your problem. If the problem is small, you should pick your own solution, for readability. If the problem is medium-sized, you might pick mine for the bump in performance. If the problem is large (either in size or frequency used), you should opt for the all numpy solution, sacrificing readability for speed.

After a bit of playing around with np.argsort() , I found a pure numpy solution. Just have to sort the second row of arr1 according to how the same elements are positioned in an array version of d.values() .

arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}

# create arrays from d
keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
# count the unique elements in arr1[:,1]
unqs, cts = np.unique(arr1[:,1], return_counts=True)

# only keep track of elements that appear in arr1
mask = np.isin(keys,unqs)
keys, vals = keys[mask], vals[mask]

# sort the unique values and corresponding counts according to keys
idx1 = np.argsort(np.argsort(keys))
idx2 = np.argsort(unqs)
unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]

# filter values by whether the counts match
correct = unqs[vals==cts]

# keep subarray where the counts match
ans = arr1[np.isin(arr1[:,1],correct)]

print(ans)
# [['a1' 'x']
#  ['a2' 'x']
#  ['a5' 'z']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM