简体   繁体   English

如何获取numpy数组中重复元素的所有索引的列表

[英]How to get a list of all indices of repeated elements in a numpy array

I'm trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds).我正在尝试获取 numpy 数组中所有重复元素的索引,但我目前找到的解决方案对于大型(> 20000 个元素)输入数组来说效率非常低(大约需要 9 秒)。 The idea is simple:这个想法很简单:

  1. records_array is a numpy array of timestamps ( datetime ) from which we want to extract the indexes of repeated timestamps records_array是一个 numpy 时间戳数组( datetime ),我们要从中提取重复时间戳的索引

  2. time_array is a numpy array containing all the timestamps that are repeated in records_array time_array是一个 numpy 数组,包含在records_array数组中重复的所有时间戳

  3. records is a django QuerySet (which can easily converted to a list) containing some Record objects. records是包含一些 Record 对象的 django QuerySet(可以轻松转换为列表)。 We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found from records_array .我们要创建的标签识别对应于发现重复的时间戳记录的属性,通过所有可能的组合形成夫妻列表records_array

Here is the working (but inefficient) code I have for the moment:这是我目前的工作(但效率低下)代码:

tag_couples = [];
for t in time_array:
    users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t
    l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t
    if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated
        tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples

I'm quite sure this can be optimized by using Numpy, but I can't find a way to compare records_array with each element of time_array without using a for loop (this can't be compared by just using == , since they are both arrays).我很确定这可以通过使用 Numpy 来优化,但是我找不到一种方法来比较records_array数组与time_array每个元素而不使用 for 循环(这不能仅使用==进行比较,因为它们是两个数组)。

A vectorized solution with numpy, on the magic of unique() .使用 numpy 的矢量化解决方案,基于unique()的魔力。

import numpy as np

# create a test array
records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])

# creates an array of indices, sorted by unique element
idx_sort = np.argsort(records_array)

# sorts records array so all unique elements are together 
sorted_records_array = records_array[idx_sort]

# returns the unique values, the index of the first occurrence of a value, and the count for each element
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)

# splits the indices into separate arrays
res = np.split(idx_sort, idx_start[1:])

#filter them with respect to their size, keeping only items occurring more than once
vals = vals[count > 1]
res = filter(lambda x: x.size > 1, res)

The following code was the original answer, which required a bit more memory, using numpy broadcasting and calling unique twice:以下代码是原始答案,需要更多内存,使用numpy广播并调用unique两次:

records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
vals, inverse, count = unique(records_array, return_inverse=True,
                              return_counts=True)

idx_vals_repeated = where(count > 1)[0]
vals_repeated = vals[idx_vals_repeated]

rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
_, inverse_rows = unique(rows, return_index=True)
res = split(cols, inverse_rows[1:])

with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]与预期的res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

  • The answer is complicated, and highly dependent upon the size, and number of unique elements.答案很复杂,并且高度依赖于独特元素的大小和数量。
  • The following:以下:
    • Tests arrays with 2M elements, and up to 20k unique elements.测试具有 2M 个元素和最多 20k 个唯一元素的数组。
    • Tests arrays up to 80k elements, with a max of 20k unique elements测试最多 80k 个元素的数组,最多 20k 个唯一元素
      • For arrays under 40k elements, the tests have up to half the unique elements as the size of the array (eg 10k elements would have up to 5k unique elements).对于少于 40k 个元素的数组,测试最多只有数组大小的一半(例如,10k 个元素将有多达 5k 个唯一元素)。

Arrays with 2M Elements具有 200 万个元素的数组

  • np.where is faster than defaultdict for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices , and np.unique .对于最多约 200 个唯一元素, np.wheredefaultdict快,但比pandas.core.groupby.GroupBy.indicesnp.unique慢。
  • The solution using pandas , is the fastest solution for large arrays.使用pandas的解决方案是大型数组的最快解决方案。

Arrays with up to 80k Elements最多 80k 个元素的数组

  • This is more situational, depending on the size of the array and the number of unique elements.这是更多情况,取决于数组的大小和唯一元素的数量。
  • defaultdict is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.对于大约 2400 个元素的数组, defaultdict是一个快速选项,尤其是具有大量唯一元素的数组。
  • For arrays larger than 40k elements, and 20k unique elements, pandas is the fastest option.对于大于 40k 个元素和 20k 个唯一元素的数组,pandas 是最快的选择。

%timeit

import random
import numpy
import pandas as pd
from collections import defaultdict

def dd(l):
    # default_dict test
    indices = defaultdict(list)
    for i, v in enumerate(l):
        indices[v].append(i)
    return indices


def npw(l):
    # np_where test
    return {v: np.where(l == v)[0] for v in np.unique(l)}


def uni(records_array):
    # np_unique test
    idx_sort = np.argsort(records_array)
    sorted_records_array = records_array[idx_sort]
    vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
    res = np.split(idx_sort, idx_start[1:])
    return dict(zip(vals, res))


def daf(l):
    # pandas test
    return pd.DataFrame(l).groupby([0]).indices


data = defaultdict(list)

for x in range(4, 20000, 100):  # number of unique elements
    # create 2M element list
    random.seed(365)
    a = np.array([random.choice(range(x)) for _ in range(2000000)])
    
    res1 = %timeit -r2 -n1 -q -o dd(a)
    res2 = %timeit -r2 -n1 -q -o npw(a)
    res3 = %timeit -r2 -n1 -q -o uni(a)
    res4 = %timeit -r2 -n1 -q -o daf(a)
    
    data['defaut_dict'].append(res1.average)
    data['np_where'].append(res2.average)
    data['np_unique'].append(res3.average)
    data['pandas'].append(res4.average)
    data['idx'].append(x)

df = pd.DataFrame(data)
df.set_index('idx', inplace=True)

df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

Tests with 2M elements使用 200 万个元素进行测试

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

Tests with up to 80k elements测试多达 80k 个元素

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

You can also do this:你也可以这样做:

a = [1,2,3,1,1,3,4,3,2]
index_sets = [np.argwhere(i==a) for i in np.unique(a)]

this will give you set of arrays with indices of unique elements.这将为您提供一组具有唯一元素索引的数组。

[array([[0],[3],[4]], dtype=int64), 
array([[1],[8]], dtype=int64), 
array([[2],[5],[7]], dtype=int64), 
array([[6]], dtype=int64)]

Added: Further change in list comprehension can also discard single unique values and address the speed concern in case of many unique single occurring elements:补充:列表理解的进一步变化也可以丢弃单个唯一值并在许多唯一的单个出现元素的情况下解决速度问题:

new_index_sets = [np.argwhere(i[0]== a) for i in np.array(np.unique(a, return_counts=True)).T if i[1]>=2]

this gives:这给出:

[array([[0],[3],[4]], dtype=int64), 
 array([[1],[8]], dtype=int64), 
 array([[2],[5],[7]], dtype=int64)]

I've found that not using np.unique , and instead using np.diff is significantly faster and handles non-sorted initial arrays much better.我发现不使用np.unique ,而是使用np.diff明显更快,并且可以更好地处理未排序的初始数组。

To show this, I ran @Trenton McKinney's benchmark for a couple of the trial numbers (2 million and 20k) to show that the diff solution floors the others.为了证明这一点,我对几个试验数字(200 万和 20k)运行了@Trenton McKinney 的基准测试,以表明差异解决方案能够满足其他人的需求。 It also does not require a sorted array or sorting the array, which is a significant benefit.它还不需要排序数组或对数组进行排序,这是一个显着的好处。

Here is the function:这是函数:

def find_repeats(arr: np.ndarray) -> np.ndarray:
    """Find indices of repeat values in an array.

    Args:
        arr (np.ndarray): An array to find repeat values in.

    Returns:
        np.ndarray: An array of indices into arr which are the values which
            repeat.
    """

    arr_diff = np.diff(arr, append=[arr[-1] + 1])
    res_mask = arr_diff == 0
    arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
    res_mask[arr_diff_zero_right] = True
    return np.nonzero(res_mask)[0]

2 Million Elements 200 万个元素

200 万找到重复测试

20k Elements 20k 元素

20k 查找重复测试

Full Test Code完整的测试代码

import random
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from collections import defaultdict
import time


def find_repeats(arr: np.ndarray) -> np.ndarray:
    """Find indices of repeat values in an array.

    Args:
        arr (np.ndarray): An array to find repeat values in.

    Returns:
        np.ndarray: An array of indices into arr which are the values which
            repeat.
    """

    arr_diff = np.diff(arr, append=[arr[-1] + 1])
    res_mask = arr_diff == 0
    arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
    res_mask[arr_diff_zero_right] = True
    return np.nonzero(res_mask)[0]


def dd(l):
    # default_dict test
    indices = defaultdict(list)
    for i, v in enumerate(l):
        indices[v].append(i)
    return indices


def npw(l):
    # np_where test
    return {v: np.where(l == v)[0] for v in np.unique(l)}


def uni(records_array):
    # np_unique test
    idx_sort = np.argsort(records_array)
    sorted_records_array = records_array[idx_sort]
    vals, idx_start, count = np.unique(
        sorted_records_array, return_counts=True, return_index=True)
    res = np.split(idx_sort, idx_start[1:])
    return dict(zip(vals, res))


def daf(l):
    # pandas test
    return pd.DataFrame(l).groupby([0]).indices


data = defaultdict(list)

for x in range(4, 20000, 1000):  # number of unique elements
    print(f"{x} trial done")
    # create 2M element list
    random.seed(365)
    a = np.array([random.choice(range(x)) for _ in range(2000000)])
    num_runs = 2
    t0 = time.time()
    for i in range(num_runs):
        dd(a)
    res1 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        uni(a)
    res3 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        daf(a)
    res4 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        find_repeats(a)
    res5 = time.time() - t0

    data['defaut_dict'].append(res1 / num_runs)
    data['np_unique'].append(res3 / num_runs)
    data['pandas'].append(res4 / num_runs)
    data['np_diff'].append(res5 / num_runs)
    data['idx'].append(x)

df = pd.DataFrame(data)
df.set_index('idx', inplace=True)

df.plot(figsize=(12, 5), xlabel='unique samples',
        ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

You could do something along the lines of:你可以做一些类似的事情:

1. add original index ref so [[1,0],[2,1],[3,2],[1,3],[1,4]...
2. sort on [:,0]
3. use np.where(ra[1:,0] != ra[:-1,0])
4. use the list of indexes from above to construct your final list of lists

EDIT - OK so after my quick reply I've been away for a while and I see I've been voted down which is fair enough as numpy.argsort() is a much better way than my suggestion.编辑 - 好的,所以在我快速回复之后,我离开了一段时间,我看到我被否决了,这很公平,因为numpy.argsort()比我的建议要好得多。 I did vote up the numpy.unique() answer as this is an interesting feature.我确实投票支持numpy.unique()答案,因为这是一个有趣的功能。 However if you use timeit you will find that但是,如果您使用 timeit,您会发现

idx_start = np.where(sorted_records_array[:-1] != sorted_records_array[1:])[0] + 1
res = np.split(idx_sort, idx_start)

is marginally faster than快于

vals, idx_start = np.unique(sorted_records_array, return_index=True)
res = np.split(idx_sort, idx_start[1:])

Further edit follow question by @Nicolas进一步编辑@Nicolas 提出的问题

I'm not sure you can.我不确定你能不能。 It would be possible to get two arrays of indices in corresponding with the break points but you can't break different 'lines' of the array up into different sized pieces using np.split so有可能获得与断点相对应的两个索引数组,但您不能使用 np.split 将数组的不同“行”分成不同大小的部分,因此

a = np.array([[4,27,42,12, 4 .. 240, 12], [3,65,23...] etc])
idx = np.argsort(a, axis=1)
sorted_a = np.diagonal(a[:, idx[:]]).T
idx_start = np.where(sorted_a[:,:-1] != sorted_a[:,1:])

# idx_start => (array([0,0,0,..1,1,..]), array([1,4,6,7..99,0,4,5]))

but that might be good enough depending on what you want to do with the information.但这可能已经足够了,具体取决于您想对信息做什么。

so I was unable to get rid of the for loop, but I was able to pair it down to using the for loop marginally using the set data type and the list.count() method:所以我无法摆脱 for 循环,但我能够使用set数据类型和list.count()方法将其与使用 for 循环list.count()

data = [1,2,3,1,4,5,2,2]
indivs = set(data)

multi_index = lambda lst, val: [i for i, x in enumerate(lst) if x == val]

if data != list(indivs):
    dupes = [multi_index(data, i) for i in indivs if data.count(i) > 1]

Where you loop over your indivs set, which contains the values (no duplicates) and then loop over the full list if you find an item with a duplicate.循环遍历包含值(无重复项)的 indivs 集,然后在找到具有重复项的项目时遍历完整列表。 Am looking into numpy alternative if this isn't fast enough for you.如果这对您来说不够快,我正在研究 numpy 替代方案。 Generator objects might also speed this up if need be.如果需要,生成器对象也可以加快速度。

Edit: gg349's answer holds the numpy solution I was working on!编辑:gg349 的答案包含我正在研究的 numpy 解决方案!

@gg349's solution packaged up into a function: @gg349 的解决方案打包成一个函数:

def better_np_unique(arr):
    sort_indexes = np.argsort(arr)
    arr = np.asarray(arr)[sort_indexes]
    vals, first_indexes, inverse, counts = np.unique(arr,
        return_index=True, return_inverse=True, return_counts=True)
    indexes = np.split(sort_indexes, first_indexes[1:])
    for x in indexes:
        x.sort()
    return vals, indexes, inverse, counts    

It's essentially the same as np.unique but returns all indices, not just the first indices.它本质上与np.unique相同,但返回所有索引,而不仅仅是第一个索引。

import numpy as np
from numpy.lib import recfunctions as rfn

ndtype = [('records_array', int)] # Check the data type
records_array = np.ma.array([1, 2, 1, 3, 2, 3, 3, 4, 5]).view(ndtype) # Structured array
idxs = list(rfn.find_duplicates(records_array, key=None, ignoremask=True, return_index=True)[1]) # List of indices of repeated elements

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM