简体   繁体   English

按其中某些元素的频率过滤 numpy 数组

[英]Filtering a numpy array by frequencies of certain elements in it

I have a numpy array and a dictionary similar to below:我有一个 numpy 数组和一个类似于下面的字典:

arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}

For each key-value pair (k,v) in d , k should appear exactly v times in arr1 in its second column.对于d每个键值对(k,v)k在其第二列的arr1中应该恰好出现v次。 Clearly that doesn't happen here.显然,这不会发生在这里。

So what I want to do is, from arr1 , I want to create another array where every element in its second column appears exactly the number of times it's supposed to according to d .所以我想要做的是,从arr1 ,我想创建另一个数组,其中第二列中的每个元素都准确地出现根据d应该出现的次数。 In other words, my desired outcome is:换句话说,我想要的结果是:

np.array([['a1','x'],['a2','x'],['a5','z']])

I can get my desired outcome using list comprehension:我可以使用列表理解获得我想要的结果:

ans = [[x1,x2] for x1,x2 in arr1 if np.count_nonzero(arr1==x2)==d[x2]]

but I was wondering if it was possible to do it only using numpy.但我想知道是否可以仅使用 numpy 来做到这一点。

This does what you want:这做你想要的:

import numpy as np

arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}

# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
result = arr1[list(map(lambda x: x[1] in keep, arr1))]

Quite possibly there's a more optimal way to do this in numpy, but I don't know how big the set you're applying to is, or how often you need to do this, to say whether looking for it is worth it.在 numpy 中很可能有一种更优化的方法来做到这一点,但我不知道你申请的集合有多大,或者你需要多久这样做一次,以说寻找它是否值得。

Edit: Note that you need to scale things up to decide what is a good solution.编辑:请注意,您需要扩大规模以决定什么是好的解决方案。 Your original solution is great for toy examples, it outperforms both answers.您的原始解决方案非常适合玩具示例,它的表现优于这两个答案。 But the numpy solution provided by @NewbieAF beats the rest handily if you scale up to what may be more realistic workloads:但是,如果您扩展到可能更现实的工作负载,@NewbieAF 提供的 numpy 解决方案可以轻松击败其他解决方案:

from random import randint
from timeit import timeit
import numpy as np


def original(arr1, d):
    return [[x1, x2] for x1, x2 in arr1 if np.count_nonzero(arr1 == x2) == d[x2]]


def f1(arr1, d):
    # get the actual counts of values in arr1
    counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
    # determine what values to keep, as their count matches the desired count
    keep = [x for x in d if x in counts and d[x] == counts[x]]
    # filter down the array
    return arr1[list(map(lambda x: x[1] in keep, arr1))]


def f2(arr1, d):
    # create arrays from d
    keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
    # count the unique elements in arr1[:,1]
    unqs, cts = np.unique(arr1[:,1], return_counts=True)

    # only keep track of elements that appear in arr1
    mask = np.isin(keys,unqs)
    keys, vals = keys[mask], vals[mask]

    # sort the unique values and corresponding counts according to keys
    idx1 = np.argsort(np.argsort(keys))
    idx2 = np.argsort(unqs)
    unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]

    # filter values by whether the counts match
    correct = unqs[vals==cts]

    return arr1[np.isin(arr1[:,1],correct)]


def main():
    arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
    d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}

    print(timeit(lambda: original(arr1, d), number=10000))
    print(timeit(lambda: f1(arr1, d), number=10000))
    print(timeit(lambda: f2(arr1, d), number=10000))

    counts = [randint(1, 3) for _ in range(10000)]
    arr1 = np.array([['x', f'{n}'] for n in range(10000) for _ in range(counts[n])])
    d = {f'{n}': randint(1, 3) for n in range(10000)}

    print(timeit(lambda: original(arr1, d), number=10))
    print(timeit(lambda: f1(arr1, d), number=10))
    print(timeit(lambda: f2(arr1, d), number=10))

main()

Result:结果:

0.14045359999999998
0.2402685
0.5027185999999999
46.7569239
5.893172499999999
0.08729539999999503

The numpy solution is slow on a toy example, but orders of magnitude faster on a large input. numpy解决方案在玩具示例上很慢,但在大输入上要快numpy数量级。 Your solution seems pretty good, but loses out to the non-numpy solution avoiding the extra calls when scaled up.您的解决方案看起来不错,但是在扩展时输给了非 numpy 解决方案,避免了额外的调用。

Consider the size of your problem.考虑问题的大小。 If the problem is small, you should pick your own solution, for readability.如果问题很小,您应该选择自己的解决方案,以提高可读性。 If the problem is medium-sized, you might pick mine for the bump in performance.如果问题是中等规模的,您可能会选择我的来提高性能。 If the problem is large (either in size or frequency used), you should opt for the all numpy solution, sacrificing readability for speed.如果问题很大(无论是大小还是使用频率),您应该选择全 numpy 解决方案,牺牲可读性来提高速度。

After a bit of playing around with np.argsort() , I found a pure numpy solution.np.argsort() ,我找到了一个纯粹的 numpy 解决方案。 Just have to sort the second row of arr1 according to how the same elements are positioned in an array version of d.values() .只需要根据相同元素在d.values()的数组版本中的位置对arr1的第二行进行排序。

arr1 = np.array([['a1','x'],['a2','x'],['a3','y'],['a4','y'],['a5','z']])
d = {'x':2,'z':1,'y':1,'w':2}

# create arrays from d
keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
# count the unique elements in arr1[:,1]
unqs, cts = np.unique(arr1[:,1], return_counts=True)

# only keep track of elements that appear in arr1
mask = np.isin(keys,unqs)
keys, vals = keys[mask], vals[mask]

# sort the unique values and corresponding counts according to keys
idx1 = np.argsort(np.argsort(keys))
idx2 = np.argsort(unqs)
unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]

# filter values by whether the counts match
correct = unqs[vals==cts]

# keep subarray where the counts match
ans = arr1[np.isin(arr1[:,1],correct)]

print(ans)
# [['a1' 'x']
#  ['a2' 'x']
#  ['a5' 'z']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM