[英]How to get a list of all indices of repeated elements in a numpy array
I'm trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds).我正在尝试获取 numpy 数组中所有重复元素的索引,但我目前找到的解决方案对于大型(> 20000 个元素)输入数组来说效率非常低(大约需要 9 秒)。 The idea is simple:这个想法很简单:
records_array
is a numpy array of timestamps ( datetime
) from which we want to extract the indexes of repeated timestamps records_array
是一个 numpy 时间戳数组( datetime
),我们要从中提取重复时间戳的索引
time_array
is a numpy array containing all the timestamps that are repeated in records_array
time_array
是一个 numpy 数组,包含在records_array
数组中重复的所有时间戳
records
is a django QuerySet (which can easily converted to a list) containing some Record objects. records
是包含一些 Record 对象的 django QuerySet(可以轻松转换为列表)。 We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found from records_array
.我们要创建的标签识别对应于发现重复的时间戳记录的属性,通过所有可能的组合形成夫妻列表records_array
。
Here is the working (but inefficient) code I have for the moment:这是我目前的工作(但效率低下)代码:
tag_couples = [];
for t in time_array:
users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t
l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t
if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated
tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples
I'm quite sure this can be optimized by using Numpy, but I can't find a way to compare records_array
with each element of time_array
without using a for loop (this can't be compared by just using ==
, since they are both arrays).我很确定这可以通过使用 Numpy 来优化,但是我找不到一种方法来比较records_array
数组与time_array
每个元素而不使用 for 循环(这不能仅使用==
进行比较,因为它们是两个数组)。
A vectorized solution with numpy, on the magic of unique()
.使用 numpy 的矢量化解决方案,基于unique()
的魔力。
import numpy as np
# create a test array
records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])
# creates an array of indices, sorted by unique element
idx_sort = np.argsort(records_array)
# sorts records array so all unique elements are together
sorted_records_array = records_array[idx_sort]
# returns the unique values, the index of the first occurrence of a value, and the count for each element
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
# splits the indices into separate arrays
res = np.split(idx_sort, idx_start[1:])
#filter them with respect to their size, keeping only items occurring more than once
vals = vals[count > 1]
res = filter(lambda x: x.size > 1, res)
The following code was the original answer, which required a bit more memory, using numpy
broadcasting and calling unique
twice:以下代码是原始答案,需要更多内存,使用numpy
广播并调用unique
两次:
records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
vals, inverse, count = unique(records_array, return_inverse=True,
return_counts=True)
idx_vals_repeated = where(count > 1)[0]
vals_repeated = vals[idx_vals_repeated]
rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
_, inverse_rows = unique(rows, return_index=True)
res = split(cols, inverse_rows[1:])
with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]
与预期的res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]
np.where
is faster than defaultdict
for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices
, and np.unique
.对于最多约 200 个唯一元素, np.where
比defaultdict
快,但比pandas.core.groupby.GroupBy.indices
和np.unique
慢。pandas
, is the fastest solution for large arrays.使用pandas
的解决方案是大型数组的最快解决方案。defaultdict
is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.对于大约 2400 个元素的数组, defaultdict
是一个快速选项,尤其是具有大量唯一元素的数组。%timeit
import random
import numpy
import pandas as pd
from collections import defaultdict
def dd(l):
# default_dict test
indices = defaultdict(list)
for i, v in enumerate(l):
indices[v].append(i)
return indices
def npw(l):
# np_where test
return {v: np.where(l == v)[0] for v in np.unique(l)}
def uni(records_array):
# np_unique test
idx_sort = np.argsort(records_array)
sorted_records_array = records_array[idx_sort]
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
res = np.split(idx_sort, idx_start[1:])
return dict(zip(vals, res))
def daf(l):
# pandas test
return pd.DataFrame(l).groupby([0]).indices
data = defaultdict(list)
for x in range(4, 20000, 100): # number of unique elements
# create 2M element list
random.seed(365)
a = np.array([random.choice(range(x)) for _ in range(2000000)])
res1 = %timeit -r2 -n1 -q -o dd(a)
res2 = %timeit -r2 -n1 -q -o npw(a)
res3 = %timeit -r2 -n1 -q -o uni(a)
res4 = %timeit -r2 -n1 -q -o daf(a)
data['defaut_dict'].append(res1.average)
data['np_where'].append(res2.average)
data['np_unique'].append(res3.average)
data['pandas'].append(res4.average)
data['idx'].append(x)
df = pd.DataFrame(data)
df.set_index('idx', inplace=True)
df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()
You can also do this:你也可以这样做:
a = [1,2,3,1,1,3,4,3,2]
index_sets = [np.argwhere(i==a) for i in np.unique(a)]
this will give you set of arrays with indices of unique elements.这将为您提供一组具有唯一元素索引的数组。
[array([[0],[3],[4]], dtype=int64),
array([[1],[8]], dtype=int64),
array([[2],[5],[7]], dtype=int64),
array([[6]], dtype=int64)]
Added: Further change in list comprehension can also discard single unique values and address the speed concern in case of many unique single occurring elements:补充:列表理解的进一步变化也可以丢弃单个唯一值并在许多唯一的单个出现元素的情况下解决速度问题:
new_index_sets = [np.argwhere(i[0]== a) for i in np.array(np.unique(a, return_counts=True)).T if i[1]>=2]
this gives:这给出:
[array([[0],[3],[4]], dtype=int64),
array([[1],[8]], dtype=int64),
array([[2],[5],[7]], dtype=int64)]
I've found that not using np.unique
, and instead using np.diff
is significantly faster and handles non-sorted initial arrays much better.我发现不使用np.unique
,而是使用np.diff
明显更快,并且可以更好地处理未排序的初始数组。
To show this, I ran @Trenton McKinney's benchmark for a couple of the trial numbers (2 million and 20k) to show that the diff solution floors the others.为了证明这一点,我对几个试验数字(200 万和 20k)运行了@Trenton McKinney 的基准测试,以表明差异解决方案能够满足其他人的需求。 It also does not require a sorted array or sorting the array, which is a significant benefit.它还不需要排序数组或对数组进行排序,这是一个显着的好处。
Here is the function:这是函数:
def find_repeats(arr: np.ndarray) -> np.ndarray:
"""Find indices of repeat values in an array.
Args:
arr (np.ndarray): An array to find repeat values in.
Returns:
np.ndarray: An array of indices into arr which are the values which
repeat.
"""
arr_diff = np.diff(arr, append=[arr[-1] + 1])
res_mask = arr_diff == 0
arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
res_mask[arr_diff_zero_right] = True
return np.nonzero(res_mask)[0]
import random
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from collections import defaultdict
import time
def find_repeats(arr: np.ndarray) -> np.ndarray:
"""Find indices of repeat values in an array.
Args:
arr (np.ndarray): An array to find repeat values in.
Returns:
np.ndarray: An array of indices into arr which are the values which
repeat.
"""
arr_diff = np.diff(arr, append=[arr[-1] + 1])
res_mask = arr_diff == 0
arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
res_mask[arr_diff_zero_right] = True
return np.nonzero(res_mask)[0]
def dd(l):
# default_dict test
indices = defaultdict(list)
for i, v in enumerate(l):
indices[v].append(i)
return indices
def npw(l):
# np_where test
return {v: np.where(l == v)[0] for v in np.unique(l)}
def uni(records_array):
# np_unique test
idx_sort = np.argsort(records_array)
sorted_records_array = records_array[idx_sort]
vals, idx_start, count = np.unique(
sorted_records_array, return_counts=True, return_index=True)
res = np.split(idx_sort, idx_start[1:])
return dict(zip(vals, res))
def daf(l):
# pandas test
return pd.DataFrame(l).groupby([0]).indices
data = defaultdict(list)
for x in range(4, 20000, 1000): # number of unique elements
print(f"{x} trial done")
# create 2M element list
random.seed(365)
a = np.array([random.choice(range(x)) for _ in range(2000000)])
num_runs = 2
t0 = time.time()
for i in range(num_runs):
dd(a)
res1 = time.time() - t0
t0 = time.time()
for i in range(num_runs):
uni(a)
res3 = time.time() - t0
t0 = time.time()
for i in range(num_runs):
daf(a)
res4 = time.time() - t0
t0 = time.time()
for i in range(num_runs):
find_repeats(a)
res5 = time.time() - t0
data['defaut_dict'].append(res1 / num_runs)
data['np_unique'].append(res3 / num_runs)
data['pandas'].append(res4 / num_runs)
data['np_diff'].append(res5 / num_runs)
data['idx'].append(x)
df = pd.DataFrame(data)
df.set_index('idx', inplace=True)
df.plot(figsize=(12, 5), xlabel='unique samples',
ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()
You could do something along the lines of:你可以做一些类似的事情:
1. add original index ref so [[1,0],[2,1],[3,2],[1,3],[1,4]...
2. sort on [:,0]
3. use np.where(ra[1:,0] != ra[:-1,0])
4. use the list of indexes from above to construct your final list of lists
EDIT - OK so after my quick reply I've been away for a while and I see I've been voted down which is fair enough as numpy.argsort()
is a much better way than my suggestion.编辑 - 好的,所以在我快速回复之后,我离开了一段时间,我看到我被否决了,这很公平,因为numpy.argsort()
比我的建议要好得多。 I did vote up the numpy.unique()
answer as this is an interesting feature.我确实投票支持numpy.unique()
答案,因为这是一个有趣的功能。 However if you use timeit you will find that但是,如果您使用 timeit,您会发现
idx_start = np.where(sorted_records_array[:-1] != sorted_records_array[1:])[0] + 1
res = np.split(idx_sort, idx_start)
is marginally faster than略快于
vals, idx_start = np.unique(sorted_records_array, return_index=True)
res = np.split(idx_sort, idx_start[1:])
Further edit follow question by @Nicolas进一步编辑@Nicolas 提出的问题
I'm not sure you can.我不确定你能不能。 It would be possible to get two arrays of indices in corresponding with the break points but you can't break different 'lines' of the array up into different sized pieces using np.split so有可能获得与断点相对应的两个索引数组,但您不能使用 np.split 将数组的不同“行”分成不同大小的部分,因此
a = np.array([[4,27,42,12, 4 .. 240, 12], [3,65,23...] etc])
idx = np.argsort(a, axis=1)
sorted_a = np.diagonal(a[:, idx[:]]).T
idx_start = np.where(sorted_a[:,:-1] != sorted_a[:,1:])
# idx_start => (array([0,0,0,..1,1,..]), array([1,4,6,7..99,0,4,5]))
but that might be good enough depending on what you want to do with the information.但这可能已经足够了,具体取决于您想对信息做什么。
so I was unable to get rid of the for loop, but I was able to pair it down to using the for loop marginally using the set
data type and the list.count()
method:所以我无法摆脱 for 循环,但我能够使用set
数据类型和list.count()
方法将其与使用 for 循环list.count()
:
data = [1,2,3,1,4,5,2,2]
indivs = set(data)
multi_index = lambda lst, val: [i for i, x in enumerate(lst) if x == val]
if data != list(indivs):
dupes = [multi_index(data, i) for i in indivs if data.count(i) > 1]
Where you loop over your indivs set, which contains the values (no duplicates) and then loop over the full list if you find an item with a duplicate.循环遍历包含值(无重复项)的 indivs 集,然后在找到具有重复项的项目时遍历完整列表。 Am looking into numpy alternative if this isn't fast enough for you.如果这对您来说不够快,我正在研究 numpy 替代方案。 Generator objects might also speed this up if need be.如果需要,生成器对象也可以加快速度。
Edit: gg349's answer holds the numpy solution I was working on!编辑:gg349 的答案包含我正在研究的 numpy 解决方案!
@gg349's solution packaged up into a function: @gg349 的解决方案打包成一个函数:
def better_np_unique(arr):
sort_indexes = np.argsort(arr)
arr = np.asarray(arr)[sort_indexes]
vals, first_indexes, inverse, counts = np.unique(arr,
return_index=True, return_inverse=True, return_counts=True)
indexes = np.split(sort_indexes, first_indexes[1:])
for x in indexes:
x.sort()
return vals, indexes, inverse, counts
It's essentially the same as np.unique
but returns all indices, not just the first indices.它本质上与np.unique
相同,但返回所有索引,而不仅仅是第一个索引。
import numpy as np
from numpy.lib import recfunctions as rfn
ndtype = [('records_array', int)] # Check the data type
records_array = np.ma.array([1, 2, 1, 3, 2, 3, 3, 4, 5]).view(ndtype) # Structured array
idxs = list(rfn.find_duplicates(records_array, key=None, ignoremask=True, return_index=True)[1]) # List of indices of repeated elements
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.