简体   繁体   English

在对列表中查找唯一对

[英]Find unique pairs in list of pairs

I have a (large) list of lists of integers, eg,我有一个(大)整数列表列表,例如,

a = [
    [1, 2],
    [3, 6],
    [2, 1],
    [3, 5],
    [3, 6]
    ]

Most of the pairs will appear twice, where the order of the integers doesn't matter (ie, [1, 2] is equivalent to [2, 1] ).大多数对会出现两次,其中整数的顺序无关紧要(即[1, 2]等价于[2, 1] )。 I'd now like to find the pairs that appear only once , and get a Boolean list indicating that.我现在想找到只出现一次的对,并获得一个指示该值的布尔列表。 For the above example,对于上面的例子,

b = [False, False, False, True, False]

Since a is typically large, I'd like to avoid explicit loops.由于a通常很大,我想避免显式循环。 Mapping to frozenset s may be advised, but I'm not sure if that's overkill.可能建议映射到frozenset s,但我不确定这是否太过分了。

ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]

We can use Counter to get counts of each list (turn list to frozenset to ignore order) and then for each list check if it only appears once.我们可以使用 Counter 来获取每个列表的计数(将列表转为frozenset 以忽略顺序),然​​后检查每个列表是否只出现一次。

Here's a solution with NumPy that 10 times faster than the suggested frozenset solution:这是一个使用 NumPy 的解决方案,它比建议的frozenset解决方案快 10 倍:

a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
    numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
print(ct[inv] == 1)

Speed comparison for different array sizes:不同数组大小的速度比较:

在此处输入图片说明

The plot was created with该情节是用

from collections import Counter
import numpy
import perfplot


def fs(a):
    ctr = Counter(frozenset(x) for x in a)
    b = [ctr[frozenset(x)] == 1 for x in a]
    return b


def with_numpy(a):
    a = numpy.array(a)
    a.sort(axis=1)
    b = numpy.ascontiguousarray(a).view(
        numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
    )
    _, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
    res = ct[inv] == 1
    return res


perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.randint(0, 10, size=(n, 2)),
    kernels=[fs, with_numpy],
    labels=["frozenset", "numpy"],
    n_range=[2 ** k for k in range(15)],
    xlabel="len(a)",
)

You could scan the list from start to end, while maintaining a map of encountered pairs to their first position.您可以从头到尾扫描列表,同时将遇到的配对map到它们的第一个位置。 Whenever you process a pair, you check to see if you've encountered it before.每当您处理一对时,您都会检查之前是否遇到过它。 If that's the case, both the first encounter's index in b and the current encounter's index must be set to False.如果是这种情况,则 b 中第一次遭遇的索引和当前遭遇的索引都必须设置为 False。 Otherwise, we just add the current index to the map of encountered pairs and change nothing about b.否则,我们只需将当前索引添加到遇到的对的映射中,而不会更改 b。 b will start initially all True . b 最初将开始所有True To keep things equivalent wrt [1,2] and [2,1] , I'd first simply sort the pair, to obtain a stable representation.为了保持相同的[1,2][2,1] ,我首先简单地对这对进行排序,以获得稳定的表示。 The code would look something like this:代码看起来像这样:

def proc(a):
  b = [True] * len(a) # Better way to allocate this
  filter = {}
  idx = 0
  for p in a:
    m = min(p)
    M = max(p)
    pp = (m, M)
    if pp in filter:
      # We've found the element once previously
      # Need to mark both it and the current value as "False"
      # If we encounter pp multiple times, we'll set the initial
      # value to False multiple times, but that's not an issue
      b[filter[pp]] = False
      b[idx] = False
    else:
      # This is the first time we encounter pp, so we just add it
      # to the filter for possible later encounters, but don't affect
      # b at all.
      filter[pp] = idx
    idx++
  return b

The time complexity is O(len(a)) which is good, but the space complexity is also O(len(a)) (for filter ), so this might not be so great.时间复杂度是O(len(a))这很好,但空间复杂度也是O(len(a)) (对于filter ),所以这可能不是那么好。 Depending on how flexible you are, you can use an approximate filter such as a Bloom filter.根据您的灵活性,您可以使用近似过滤器,例如布隆过滤器。

#-*- coding : utf-8 -*-
a = [[1, 2], [3, 6], [2, 1], [3, 5], [3, 6]]
result = filter(lambda el:(a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1),a)
bool_res = [ (a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1) for el in a]
print result
print bool_res

wich gives :给出:

[[3, 5]]
[False, False, False, True, False]

Use a dictionary for an O(n) solution.将字典用于 O(n) 解决方案。

a = [ [1, 2], [3, 6], [2, 1], [3, 5], [3, 6] ]

dict = {}
boolList = []

# Iterate through a
for i in range (len(a)):

    # Assume that this element is not a duplicate
    # This 'True' is added to the corresponding index i of boolList
    boolList += [True]

    # Set elem to the current pair in the list
    elem = a[i]

    # If elem is in ascending order, it will be entered into the map as is
    if elem[0] <= elem[1]:
        key = repr(elem)
    # If not, change it into ascending order so keys can easily be compared
    else:
        key = repr( [ elem[1] ] + [ elem[0] ])

    # If this pair has not yet been seen, add it as a key to the dictionary
    # with the value a list containing its index in a.
    if key not in dict:
        dict[key] = [i]
    # If this pair is a duploicate, add the new index to the dict. The value
    # of the key will contain a list containing the indeces of that pair in a.
    else:
        # Change the value to contain the new index
        dict[key] += [i]

        # Change boolList for this to True for this index
        boolList[i] = False

        # If this is the first duplicate for the pair, make the first
        # occurrence of the pair into a duplicate as well.
        if len(dict[key]) <= 2:
            boolList[ dict[key][0] ] = False

print a
print boolList

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM