简体   繁体   中英

Pythonic way of finding indexes of unique elements in two arrays

I have two sorted, numpy arrays similar to these ones:

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])

Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.

For instance, 1 exists in x and y at index 0 . Element 2 in x doesn't exist in y , so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y . Similarly, 15 exists in both, in index 4 in x , but index 2 in y . So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]] .

So far what I'm doing is:

def get_indexes(x, y):
    indexes = []
    for i in range(len(x)):
        # Find index where item x[i] is in y:
        j = np.where(x[i] == y)[0]

        # If it exists, save it:
        if len(j) != 0:
            indexes.append([i, j[0]])

    return indexes

But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?

What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.

See below:

x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]

def get_indexes(x,y):
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return ans

print(get_indexes(x,y))

which gives me:

[[0, 0], [2, 1], [4, 2]]

You can use numpy.searchsorted :

def get_indexes(x, y):
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        if y[idy]==val:
            indexes.append([idx,idy])
    return indexes

One solution is to first look from x 's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero , and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:

indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2])

Now you can zip them to get the result:

result = list(zip(indices_in_x, indices_in_y))               # [(0, 0), (2, 1), (4, 2)]

Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.

def get_indexes(x, y):
    return [
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ]

Without Python loops

Code

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

Usage

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)

# result[0]: array([[0, 0],
                    [2, 1],
                    [4, 2]], dtype=int64)

Timing Posted Solutions

Results show that darrlg code has the fastest run time.

在此处输入图像描述

Code Adjustment

  • Each posted solution as a function.
  • Slight mod so that each solution outputs an numpy array.
  • Curve named after poster

Code

import numpy as np
import perfplot

def create_arr(n):
    ' Creates pair of 1d numpy arrays with half the elements equal '
    max_val = 100000     # One more than largest value in output arrays
    
    arr1 = np.random.randint(0, max_val, (n,))
    arr2 = arr1.copy()
    
    # Change half the elements in arr2
    all_indexes = np.arange(0, n, dtype=int)
    indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
    
    
    np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, )))        # assign new random values at change locations
   
    arr1 = np.sort(arr1)
    arr2 = np.sort(arr2)
    
    return (arr1, arr2)

def get_indexes_lllrnr101(x,y):
    ' lllrnr101 answer '
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return np.array(ans)

def get_indexes_joostblack(x, y):
    'joostblack'
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        try:
            if y[idy]==val:
                indexes.append([idx,idy])
        except IndexError:
            continue  # ignore index errors
            
    return np.array(indexes)

def get_indexes_mustafa(x, y):
    indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
    indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2]
    
    return np.array(list(zip(indices_in_x, indices_in_y)))

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

def get_indexes_akopcz(x, y):
    ' akopcz answer '
    return np.array([
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ])

perfplot.show(
    setup = create_arr,  # tuple of two 1D random arrays
    kernels=[
        lambda a: get_indexes_lllrnr101(*a),
        lambda a: get_indexes_joostblack(*a),
        lambda a: get_indexes_mustafa(*a),
        lambda a: get_indexes_darrylg(*a),
        lambda a: get_indexes_akopcz(*a),
    ],
    labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
    n_range=[2 ** k for k in range(5, 21)],
    xlabel="Array Length",
    # More optional arguments with their default values:
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    equality_check=None, #np.allclose,  # set to None to disable "correctness" assertion
    # show_progress=True,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM