简体   繁体   English

Pythonic方式合并两个重叠列表,保留顺序

[英]Pythonic way to merge two overlapping lists, preserving order

Alright, so I have two lists, as such: 好的,所以我有两个列表,如下:

  • They can and will have overlapping items, for example, [1, 2, 3, 4, 5] , [4, 5, 6, 7] . 它们可以并且将具有重叠项目,例如, [1, 2, 3, 4, 5][4, 5, 6, 7]
  • There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5] , [3.5, 4, 5, 6, 7] 重叠中不会有其他项目,例如,这不会发生: [1, 2, 3, 4, 5][3.5, 4, 5, 6, 7]
  • The lists are not necessarily ordered nor unique. 列表不一定是有序的也不是唯一的。 [9, 1, 1, 8, 7] , [8, 6, 7] . [9, 1, 1, 8, 7][8, 6, 7]

I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. 我想合并列表,以便保留现有订单,并在最后可能的有效位置合并,以便不丢失任何数据。 Additionally, the first list might be huge. 此外,第一个列表可能很大。 My current working code is as such: 我目前的工作代码是这样的:

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]

def merge(master, addition):
    n = 1
    while n < len(master):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
        n += 1
    return master + addition

What I would like to know is - is there a more efficient way of doing this? 我想知道的是 - 有更有效的方法吗? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings. 它可以工作,但我对此有点怀疑,因为它可以在我的应用程序中遇到大的运行时 - 我正在合并大量的字符串列表。

EDIT: I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8, 3,4,5 ,7,8]. 编辑:我预计[1,3,9,8,3,4,5],[3,4,5,7,8]合并为:[1,3,9,8,3 , 4,5,7,8]。 For clarity, I've highlighted the overlapping portion. 为清楚起见,我突出了重叠部分。

[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7] [9,1,1,8,7],[8,6,7]应合并为[9,1,1,8,7,8,6,7]

You can try the following: 您可以尝试以下方法:

>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]

>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]

The idea is we check the first i elements of b ( b[:i] ) with the last i elements of a ( a[-i:] ). 我们的想法是,我们检查第一i的元素bb[:i]与最后i的元件aa[-i:] )。 We take i in decreasing order, starting from the length of b until 1 ( xrange(len(b), 0, -1) ) because we want to match as much as possible. 我们把i按递减顺序,从长度开始b ,直到1( xrange(len(b), 0, -1)因为我们想尽可能地匹配。 We take the first such i by using next and if we don't find it we use the zero value ( next(..., 0) ). 我们通过使用next来获取第一个这样的i ,如果我们找不到它,我们使用零值( next(..., 0) )。 From the moment we found the i , we add to a the elements of b from index i . 从我们找到i的那一刻起,我们就从索引i添加a b的元素。

There are a couple of easy optimizations that are possible. 有几种简单的优化是可能的。

  1. You don't need to start at master[1], since the longest overlap starts at master[-len(addition)] 你不需要从master [1]开始,因为最长的重叠从master [-len(加法)]开始

  2. If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index: 如果添加对list.index的调用,则可以避免创建子列表并比较每个索引的列表:

This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy): 这种方法使代码也很容易理解(并且通过使用cython或pypy更容易优化):

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]

def merge(master, addition):
    first = addition[0]
    n = max(len(master) - len(addition), 1)  # (1)
    while 1:
        try:
            n = master.index(first, n)       # (2)
        except ValueError:
            return master + addition

        if master[-n:] == addition[:n]:
            return master + addition[n:]
        n += 1

One trivial optimization is not iterating over the whole master list. 一个简单的优化不是遍历整个master列表。 Ie, replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). 即, while n < len(master)替换for n in range(min(len(addition), len(master))) (并且不要在循环中递增n )。 If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length. 如果没有匹配,则当前代码将遍历整个master列表,即使被比较的切片甚至不是相同的长度。

Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. 另一个问题是你要使用master片和addition片来比较它们,每次创建两个新列表,并不是真的有必要。 This solution (inspired by Boyer-Moore ) doesn't use slicing: 此解决方案(受Boyer-Moore启发)不使用切片:

def merge(master, addition):
    overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
    for overlap_len in overlap_lens:
        for i in range(overlap_len):
            if master[-overlap_len + i] != addition[i]:
                break
        else:
            return master + addition[overlap_len:]
    return master + addition

The idea here is to generate all the indices of the last element of master in addition , and add 1 to each. 这里的想法是生成的最后一个元素的所有索引masteraddition ,加1到每个。 Since a valid overlap must end with the last element of master , only those values are lengths of possible overlaps. 由于有效重叠必须以master的最后一个元素结束,因此只有那些值是可能重叠的长度。 Then we can check for each of them if the elements before it also line up. 然后我们可以检查它们中的每个元素是否也排成一行。

The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). 该函数当前假定masteraddition更长(如果不是,你可能会在master[-overlap_len + i]得到一个IndexError )。 Add a condition to the overlap_lens generator if you can't guarantee it. 如果您不能保证,请向overlap_lens生成器添加条件。

It's also non-greedy, ie it looks for the smallest non-empty overlap ( merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3] ). 它也是非贪婪的,即它寻找最小的非空重叠( merge([1, 2, 2], [2, 2, 3])将返回[1, 2, 2, 2, 3] )。 I think that's what you meant by "to merge at the last possible valid position". 我认为这就是“在最后可能的有效位置合并”的意思。 If you want a greedy version, reverse the overlap_lens generator. 如果你想要一个贪婪的版本,请反转overlap_lens生成器。

I don't offer optimizations but another way of looking at the problem. 我不提供优化,而是另一种查看问题的方法。 To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. 对我来说,这似乎是http://en.wikipedia.org/wiki/Longest_common_substring_problem的特例,其中子字符串始终位于列表/字符串的末尾。 The following algorithm is the dynamic programming version. 以下算法是动态编程版本。

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return x_longest - longest, x_longest

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition

master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition
else:
    print master + addition

[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]

This actually isn't too terribly difficult. 这实际上并不是非常困难。 After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B. 毕竟,基本上你所做的只是检查A末尾的子串与B的子串对齐。

def merge(a, b):
    max_offset = len(b)  # can't overlap with greater size than len(b)
    for i in reversed(range(max_offset+1)):
        # checks for equivalence of decreasing sized slices
        if a[-i:] == b[:i]:
            break
    return a + b[i:]

We can test with your test data by doing: 我们可以通过以下方式测试您的测试数据:

test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
             {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]

all(merge(test['a'], test['b']) == test['result'] for test in test_data)

This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. 这将贯穿切片的每个可能组合,这可能导致重叠,并且如果找到重叠,则会记住重叠的结果。 If nothing is found, it uses the last result of i which will always be 0 . 如果找不到任何内容,则使用i的最后结果,该结果将始终为0 Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything) 无论哪种方式,它返回所有的a加一切过去b[i]在重叠的情况下,这是不重叠的部分,在非重叠的情况下,它的所有内容)

Note that we can make a couple optimizations in corner cases. 请注意,我们可以在极端情况下进行一些优化。 For instance, the worst case here is that it runs through the whole list without finding any solution. 例如,这里最糟糕的情况是它在整个列表中运行而没有找到任何解决方案。 You could add a quick check at the beginning that might short circuit that worst case 您可以在开头添加一个快速检查,可能会使最坏情况发生短路

def merge(a, b):
    if a[-1] not in b:
        return a + b
    ...

In fact you could take that solution one step further and probably make your algorithm much faster 实际上,您可以将该解决方案更进一步,并可能使您的算法更快

def merge(a, b):
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            return a + b
        if a[-idx:] == b[:idx]:
            return a + b[:idx]

However this might not find the longest overlap in cases like: 然而,在以下情况下,这可能找不到最长的重叠:

a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]

You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. 您可以修复使用rindex而不是index来匹配最长的切片而不是最短的切片,但我不确定这对您的速度有什么影响。 It's certainly slower, but it might be inconsequential. 它肯定比较慢,但可能无关紧要。 You could also memoize the results and return the shortest result, which might be a better idea. 您还可以记住结果并返回最短的结果,这可能是一个更好的主意。

def merge(a, b):
    results = []
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            results.append(a + b)
            break
        if a[-idx:] == b[:idx]:
            results.append(a + b[:idx])
    return min(results, key=len)

Which should work since merging the longest overlap should produce the shortest result in all cases. 哪个应该起作用,因为合并最长的重叠应该在所有情况下产生最短的结果。

First of all and for clarity, you can replace your while loop with a for loop: 首先,为了清楚起见,您可以使用for循环替换while循环:

def merge(master, addition):
    for n in xrange(1, len(master)):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition

Then, you don't have to compare all possible slices, but only those for which master 's slice starts with the first element of addition : 然后,您不必比较所有可能的切片,而只需要比较master切片以第一个addition元素开始的切片:

def merge(master, addition):
    indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
    for n in indices:
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition

So instead of comparing slices like this: 所以不要像这样比较切片:

1234123141234
            3579
           3579
          3579
         3579
        3579
       3579
      3579
     3579
    3579
   3579
  3579
 3579
3579

you are only doing these comparisons: 你只是在进行这些比较:

1234123141234
  |   |    |
  |   |    3579
  |   3579
  3579

How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better. 这将加快您的程序速度取决于您的数据的性质:您的列表具有的重复元素越少越好。

You could also generate a list of indices for addition so its own slices always end with master 's last element, further restricting the number of comparisons. 您还可以生成一个索引列表以便addition因此它自己的切片总是以master的最后一个元素结束,这进一步限制了比较的数量。

Based on https://stackoverflow.com/a/30056066/541208 : 基于https://stackoverflow.com/a/30056066/541208

def join_two_lists(a, b):
  index = 0
  for i in xrange(len(b), 0, -1):
    #if everything from start to ith of b is the 
    #same from the end of a at ith append the result
    if b[:i] == a[-i:]:
        index = i
        break

  return a + b[index:]

All the above solutions are similar in terms of using a for / while loop for the merging task. 所有上述解决方案在使用for / while循环用于合并任务方面是类似的。 I first tried the solutions by @JuniorCompressor and @TankorSmash, but these solutions are way too slow for merging two large-scale lists (eg lists with about millions of elements). 我首先尝试了@JuniorCompressor和@TankorSmash的解决方案,但是这些解决方案对于合并两个大型列表(例如包含大约数百万个元素的列表)来说太慢了。

I found using pandas to concatenate lists with large size is much more time-efficient: 我发现使用pandas来连接大尺寸的列表更加节省时间:

import pandas as pd, numpy as np

trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )

testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )

# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)

mergedCompIds = np.array(compoundIdMaps["compoundId"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM