简体   繁体   中英

Reordering and slicing a numpy array using another array

Given two arrays, arr1 and arr2 , I want to extract from arr1 , the first cap elements in the order it appears in arr2 . So from the example below:

arr1 = np.array([3, 7, 1, 10, 2])
arr2 = np.array([0, 3, 1, 5, 2, 7, 6, 4])
cap = 3

From arr1 , 3,7,1,2 appear in arr2 ordered as 3,1,2,7 . I only want the first 3, so my desired outcome is array([3, 1, 2]) .

I can get my desired outcome as follows:

ans = np.array([el for el in arr2 if el in arr1][:cap])

but this solution iterates over the entire arr2 when I only want the first 3 occurrences. Is there a more efficient solution?

Here's a straightforward iterative approach that stops after the cap has been reached:

lst=[]
for el in arr2:
    if el in arr1:
        lst.append(el)
        cap -=1
    if cap <1:
        break
ans = np.array(lst)

Here is an attempt that doesn't use any iteration, and switches the membership checking to np.isin() :

>>> arr2[np.isin(arr2, arr1[np.isin(arr1, arr2)])][:cap]
array([3, 1, 2])

Since this is arguably harder to read than your list comprehension, here are the steps broken down:

# find elements of arr1 in arr2
subset_arr1 = arr1[np.isin(arr1, arr2)] # array([3, 7, 1, 2])

# find the elements of arr2 in subset1, this gets the desired order
subset_arr2 = arr2[np.isin(arr2, subset_arr1)] # array([3, 1, 2, 7])

# take the first cap elements
ans = subset_arr2[:cap] # array([3, 1, 2])

Timing

My original testing indicated my np.isin() approach to be the fastest. However, Georgy since updated their answer (the for loop) with a bug fix. The new testing then indicated Georgy's answer to be fastest for all tested data sizes :

在此处输入图片说明

"np.isin" is my answer, "listcomp" is your original solution, "forloop" is the other existing answer . In the above, the size of N (x-axis) is the length of arr1 , arr2 is double that length, and the cap is fixed at 5 (I thought this seemed reasonable, but see below for tweaks).

It seems the ability to stop early by the for loop can be very beneficial! In my answer, the code has to check all elements of arr1 for membership in arr2 ( subset_arr1 above), and then all elements of arr2 for membership in that subset. Georgy's loop OTOH only has to check for membership of some of the elements of arr2 in arr1 . So even though my looping should be faster, there can be much more of it!

The only parameter I tried fiddling with was the cap , which shows how smaller cap sizes are better for the for loop (smaller cap -> less iteration). The above was cap == 5 ; with cap == 100 :

在此处输入图片说明

And cap == 1000 :

在此处输入图片说明

I reckon the other parameters could influence things as well (eg how likely it is for the is-in operation to return true AKA the range of numerical values, or the relative sizes of the two arrays). I have made the perfplot code below more parameterized if anyone wants to test - I'm tired of thinking about complexity 😜:

import numpy as np
import perfplot

cap = 5
minnumber = 1
maxnumber = 10
factor1 = 1 # how size of arr1 relates to n
factor2 = 2 # how size of arr2 relates to n
low_n = 3
high_n = 18

def make_data(n):
    arr1 = np.random.randint(minnumber, maxnumber, n*factor1)
    arr2 = np.random.randint(minnumber, maxnumber, n*factor2)
    return arr1, arr2

def listcomp(arr1, arr2):
    return np.array([el for el in arr2 if el in arr1][:cap])

def npisin(arr1, arr2):
    return arr2[np.isin(arr2, arr1[np.isin(arr1, arr2)])][:cap]

def forloop(arr1, arr2):
    capcopy = cap
    lst=[]
    for el in arr2:
        if el in arr1:
            lst.append(el)
            capcopy -=1
        if capcopy <1:
            break
    return np.array(lst)

perfplot.show(
    setup=lambda n : make_data(n),
    kernels = [
        lambda x: listcomp(*x),
        lambda x: npisin(*x),
        lambda x: forloop(*x),
        ],
    labels=['listcomp', 'np.isin', 'forloop'],
    n_range=[2 ** k for k in range(low_n, high_n)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM