简体   繁体   English

仅使用python列表(不设置)的列表交集算法实现

[英]list intersection algorithm implementation only using python lists (not sets)

I've been trying to write down a list intersection algorithm in python that takes care of repetitions. 我一直在尝试用python写下一个列表交集算法,该算法负责重复。 I'm a newbie to python and programming so forgive me if this sounds inefficient, but I couldn't come up with anything else. 我是python和编程的新手,如果听起来效率不高,请原谅我,但我无能为力。 Here, L1 and L2 are the two lists in question, and L is the intersection set. 这里,L1和L2是所讨论的两个列表,L是交集。

  1. Iterate through L1 遍历L1
  2. Iterate through L2 遍历L2
  3. If element is in L1 and in L2 如果元素在L1和L2中
  4. add it to L 将其添加到L
  5. remove it from L1 and L2 从L1和L2删除
  6. iterate through L 遍历L
  7. add elements back to L1 and L2 将元素添加回L1和L2

I'm 100% sure this is not the algorithm used within Mathematica to evaluate list intersection, but I can't really come up with anything more efficient. 我100%肯定这不是Mathematica中用于评估列表交集的算法,但是我真的无法提出更有效的方法。 I don't want to modify L1 and L2 in the process, hence me adding back the intersection to both lists. 我不想在此过程中修改L1和L2,因此我将交集重新添加到两个列表中。 Any ideas? 有任何想法吗? I don't want to make use of any built in functions/data types other than lists, so no import sets or anything like that. 我不想利用列表以外的任何内置函数/数据类型,因此没有导入集或类似的东西。 This is an algorithmic and implementation exercise, not a programming one, as far as I'm concerned. 就我而言,这是算法和实现练习,而不是编程练习。

Anything that iterates through L1 , iterating through all of L2 each time, will take quadratic time. 遍历L1所有内容,每次遍历L2所有内容,都将花费二次时间。 The only way to improve that is to avoid iterating through all of L2 . 改进的唯一方法是避免迭代所有L2 (There's a similar issue removing duplicates from L at the end.) (存在类似的问题,最后从L删除重复项。)

If you use a set for L2 (and for L ), of course each in L2 step is constant time, so the overall algorithm is linear. 如果对L2 (和L )使用set ,那么in L2步骤中的每个步当然都是恒定时间,因此整个算法是线性的。 And you can always build your own hash table implementation instead of using set . 而且,您始终可以构建自己的哈希表实现,而不是使用set But that's a lot of work. 但这是很多工作。

With a binary search tree, or even just a sorted list and a binary_find function, you can do it in O(N log N). 使用二叉搜索树,甚至只是一个排序列表和binary_find函数,都可以在O(N log N)中进行。 And that binary_find is much easier to write yourself. 而且, binary_find更容易编写自己。 So: 所以:

S2 = sorted(L2)
L = [element for element in L1 if binary_find(element, S2)]
S = remove_adjacent(sorted(L))

Or, even more simply, sort L1 too, and then you don't need remove_adjacent : 或者,更简单地说,也对L1进行排序,那么您就不需要remove_adjacent

S1, S2 = sorted(L1), sorted(L2)
L = []
for element in S1:
    if binary_find(element, S2) and (not L or L[-1] != element):
        L.append(element)

Either way, this is O(N log N), where N is the length of the longer list. 无论哪种方式,这都是O(N log N),其中N是较长列表的长度。 By comparison, the original is O(N^2), and the other answers are O(N^3). 相比之下,原始答案是O(N ^ 2),其他答案是O(N ^ 3)。 Of course it's a bit more complicated, but it's still pretty easy to understand. 当然,它有点复杂,但是仍然很容易理解。

You need to write the binary_find (and, if applicable, remove_adjacent ), because I assume you don't want to use stuff out of the stdlib if you don't even want to use extra builtins. 您需要编写binary_find (如果适用,还需要编写remove_adjacent ),因为我假设即使您不想使用额外的内置函数,也不想使用stdlib中的内容。 But that's really easy. 但这真的很容易。 For example: 例如:

def binary_find(element, seq):
    low, high = 0, len(seq), 
    while low != high:
        mid = (low + high) // 2
        if seq[mid] == element:
            return True
        elif seq[mid] < element:
            low = mid+1
        else:
            high = mid
    return False

def remove_adjacent(seq):
    ret = []
    last = object()
    for element in seq:
        if element != last:
            ret.append(element)
        last = element
    return ret

If you don't even want to use sorted or list.sort , you can write your own sort pretty easily too. 如果您甚至不想使用sortedlist.sort ,也可以很容易地编写自己的排序。

Here is a faster solution: 这是一个更快的解决方案:

def intersect_sorted(a1, a2):
  """Yields the intersection of sorted lists a1 and a2, without deduplication.

  Execution time is O(min(lo + hi, lo * log(hi))), where lo == min(len(a1),
  len(a2)) and hi == max(len(a1), len(a2)). It can be faster depending on
  the data.
  """
  import bisect, math
  s1, s2 = len(a1), len(a2)
  i1 = i2 = 0
  if s1 and s1 + s2 > min(s1, s2) * math.log(max(s1, s2)) * 1.4426950408889634:
    bi = bisect.bisect_left
    while i1 < s1 and i2 < s2:
      v1, v2 = a1[i1], a2[i2]
      if v1 == v2:
        yield v1
        i1 += 1
        i2 += 1
      elif v1 < v2:
        i1 = bi(a1, v2, i1)
      else:
        i2 = bi(a2, v1, i2)
  else:  # The linear solution is faster.
    while i1 < s1 and i2 < s2:
      v1, v2 = a1[i1], a2[i2]
      if v1 == v2:
        yield v1
        i1 += 1
        i2 += 1
      elif v1 < v2:
        i1 += 1
      else:
        i2 += 1

It runs in O(min(n + m, n * log(m))) time where n is the minimum of the lengths and m is the maximum. 它以O(min(n + m, n * log(m)))时间运行,其中n是长度的最小值,而m是最大值。 It iterates over both lists at the same time, skipping as many elements in the beginning as possible. 它同时遍历两个列表,并在开头尽可能地跳过尽可能多的元素。

An analysis is available here: http://ptspts.blogspot.ch/2015/11/how-to-compute-intersection-of-two.html 可在此处进行分析: http : //ptspts.blogspot.ch/2015/11/how-to-compute-intersection-of-two.html

How about: 怎么样:

  1. Iterate though L1 遍历L1
  2. Iterate though L2 遍历L2
  3. If (in L1 and L2) and not in L -> add to L 如果(在L1和L2中)而不在L中->添加到L

Not particularly efficient, but in code it would look something like this (with repetitions to make the point): 效率不是特别高,但是在代码中看起来像这样(重复说明):

>>> L1 = [1,2,3,3,4]
>>> L2 = [2,3,4,4,5]
>>> L = list()
>>> for v1 in L1:
        for v2 in L2:
            if v1 == v2 and v1 not in L:
                L.append(v1)
>>> L
[2,3,4]

You avoid deleting from L1 and L2 simply by checking if the element is already in L and adding to L if it is not. 您只需通过检查元素是否已在L中,然后将其添加到L中来避免从L1和L2中删除。 Then it doesn't matter if there are repetitions in L1 and L2. 然后,L1和L2中是否存在重复都没关系。

EDIT: I read the title wrong, and skimmed over the builtins part. 编辑:我读错了标题,并浏览了内置部分。 I'm gonna leave it here anyway, might help someone else. 我还是要把它留在这里,可能会帮助别人。

You can acheive this using the set type. 您可以使用set类型来实现。

>>> a = [1,2,3,4]
>>> b = [3,4,5,6]
>>> c = list(set(a) & set(b))
>>> c
[3, 4]
  1. Make a temporary list. 做一个临时清单。
  2. Iterate through one of the two lists. 遍历两个列表之一。 It doesn't matter which one. 哪一个都没关系。
  3. For every element, check to see if that element exists in the other list ( if element in list2 ) and isn't already in your temporary list (same syntax) 对于每个元素,请检查该元素是否在另一个列表中( if element in list2 )并且不在您的临时列表中(相同的语法)
  4. If both conditions are true, append it to your temporary list. 如果两个条件都成立,请将其附加到您的临时列表中。

I feel bad for posting the solution, but it's honestly more readable than my text: 我为发布解决方案感到难过,但说实话,它比我的文字可读性强:

def intersection(l1, l2):
    temp = []

    for item in l1:
        if item in l2 and item not in temp:
            temp.append(item)

    return temp

A pythonic and efficient way to compute the intersection of two lists preserving the order AND eliminating duplicates is the following: 以下是一种Python高效的方法,用于计算两个列表的交集以保留顺序并消除重复项:

L1 = [1,2,3,3,4,4,4,5,6]
L2 = [2,4,6]
aux = set()
L = [x for x in L1 if x in L2 and not (x in aux or aux.add(x)) ]

The solution uses the set "aux" to store elements already added to the resulting list. 该解决方案使用集合“ aux”来存储已经添加到结果列表中的元素。

Note that you don't need to "import" sets, because they are native data types in Python. 请注意,您不需要“导入”集合,因为它们是Python中的本机数据类型。 But if you insist on not using sets, you can opt for this less efficient version that uses a list instead: 但是,如果您坚持不使用集合,则可以选择使用列表的效率较低的版本:

L1 = [1,2,3,3,4,4,4,5,6]
L2 = [2,4,6]
aux = []
L = [x for x in L1 if x in L2 and not (x in aux or aux.append(x)) ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM