简体   繁体   English

查找数组中重复元素的索引(Python、NumPy)

[英]Find indexes of repeated elements in an array (Python, NumPy)

Assume, I have a NumPy-array of integers, as:假设,我有一个 NumPy 整数数组,如下所示:

[34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

I want to find the start and end indices of the array, where a value is more than x-times (say 5-times) repeated.我想找到数组的开始和结束索引,其中一个值重复的次数超过 x 次(比如 5 次)。 So in the case above, it is the value 22 and 6. Start index of the repeated 22 is 3 and end-index is 8. Same for the repeatening 6. Is there a special tool in Python that is helpful?所以在上面的例子中,它是值 22 和 6。重复 22 的开始索引是 3,结束索引是 8。重复 6 也是如此。Python 中是否有一个特殊的工具有用? Otherwise, I would loop through the array index for index and compare the actual value with the previous.否则,我将遍历索引的数组索引并将实际值与前一个值进行比较。

Regards.问候。

Using np.diff and the method given here by @WarrenWeckesser for finding runs of zeros in an array:使用np.diffnp.diff此处给出的方法来查找数组中的零运行:

import numpy as np

def zero_runs(a):  # from link
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

a = [34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

zero_runs(np.diff(a))
Out[87]: 
array([[ 3,  8],
       [15, 22]], dtype=int32)

This can then be filtered on the difference between the start & end of the run:然后可以根据运行开始和结束之间的差异进行过滤:

runs = zero_runs(np.diff(a))

runs[runs[:, 1]-runs[:, 0]>5]  # runs of 7 or more, to illustrate filter
Out[96]: array([[15, 22]], dtype=int32)

There really isn't a great short-cut for this.这真的没有什么捷径可走。 You can do something like:您可以执行以下操作:

mult = 5
for elem in val_list:
    target = [elem] * mult
    found_at = val_list.index(target)

I leave the not-found exceptions and longer sequence detection to you.我将未找到的异常和更长的序列检测留给您。

Here is a solution using Python's native itertools .这是使用 Python 的原生itertools的解决方案。

Code代码

import itertools as it


def find_ranges(lst, n=2):
    """Return ranges for `n` or more repeated values."""
    groups = ((k, tuple(g)) for k, g in it.groupby(enumerate(lst), lambda x: x[-1]))
    repeated = (idx_g for k, idx_g in groups if len(idx_g) >=n)
    return ((sub[0][0], sub[-1][0]) for sub in repeated)

lst = [34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]    
list(find_ranges(lst, 5))
# [(3, 8), (15, 22)]

Tests测试

import nose.tools as nt


def test_ranges(f):
    """Verify list results identifying ranges."""
    nt.eq_(list(f([])), [])
    nt.eq_(list(f([0, 1,1,1,1,1,1, 2], 5)), [(1, 6)])
    nt.eq_(list(f([1,1,1,1,1,1, 2,2, 1, 3, 1,1,1,1,1,1], 5)), [(0, 5), (10, 15)])
    nt.eq_(list(f([1,1, 2, 1,1,1,1, 2, 1,1,1], 3)), [(3, 6), (8, 10)])    
    nt.eq_(list(f([1,1,1,1, 2, 1,1,1, 2, 1,1,1,1], 3)), [(0, 3), (5, 7), (9, 12)])

test_ranges(find_ranges)

This example captures (index, element) pairs in lst , and then groups them by element.此示例在lst捕获 (index, element) 对,然后按元素对它们进行分组。 Only repeated pairs are retained.只保留重复的对。 Finally, first and last pairs are sliced, yielding (start, end) indices from each repeated group.最后,第一对和最后一对被切片,从每个重复的组中产生(开始,结束)索引。

See also this post for finding ranges of indices using itertools.groupby .另请参阅此帖子以使用itertools.groupby查找索引范围。

If you're looking for value repeated n times in list L , you could do something like this:如果您要在列表L查找重复n次的value ,您可以执行以下操作:

def find_repeat(value, n, L):
    look_for = [value for _ in range(n)]
    for i in range(len(L)):
        if L[i] == value and L[i:i+n] == look_for:
            return i, i+n

Here is a relatively quick, errorless solution which also tells you how many copies were in the run.这是一个相对快速、无错误的解决方案,它还告诉您运行中的副本数。 Some of this code was borrowed from KAL's solution.其中一些代码是从 KAL 的解决方案中借用的。

# Return the start and (1-past-the-end) indices of the first instance of
# at least min_count copies of element value in container l 
def find_repeat(value, min_count, l):
  look_for = [value for _ in range(min_count)]
  for i in range(len(l)):
    count = 0
    while l[i + count] == value:
      count += 1
    if count >= min_count:
      return i, i + count

I had a similar requirement.我也有类似的需求。 This is what I came up with, using only comprehension lists:这就是我想出的,仅使用理解列表:

A=[34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

Find unique and return their indices找到唯一的并返回它们的索引

_, ind = np.unique(A,return_index=True)

np.unique sorts the array, sort the indices to get the indices in the original order np.unique 对数组进行排序,对索引进行排序以按原始顺序获取索引

ind = np.sort(ind)

ind contains the indices of the first element in the repeating group, visible by non-consecutive indices Their diff gives the number of elements in a group. ind包含重复组中第一个元素的索引,通过非连续索引可见它们的diff给出了组中元素的数量。 Filtering using np.diff(ind)>5 shall give a boolean array with True at the starting indices of groups.使用np.diff(ind)>5过滤将在组的起始索引处给出一个带有True的布尔数组。 The ind array contains the end indices of each group just after each True in the filtered list ind数组包含过滤列表中每个True之后的每个组的结束索引

Create a dict with the key as the repeating element and the values as a tuple of start and end indices of that group创建一个字典,以键作为重复元素,值作为该组的开始和结束索引的元组

rep_groups = dict((A[ind[i]], (ind[i], ind[i+1]-1)) for i,v in enumerate(np.diff(ind)>5) if v)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM