简体   繁体   中英

Merge sequences of unique elements

I'm trying to merge a number of sequences, as in the following example:

x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']

merged = ['one', 'two', 'three', 'four', 'five']

The given sequences are all subsequences of the same, duplicate-free sequence (which is not given). If the order cannot be determined – as with 'four' and 'five' in the example, which could also be inverted – either solution is ok.

The problem resembles multiple sequence alignment, but I suspect there is an (algorithmically) easier solution, since it is more restricted (no duplicates, no crossing edges). Eg. when starting from the union of all elements, I would only need to order the elements – but I can't seem to find a decent way to deduce the underlying order from the input sequences.

The example is in Python and a desired solution would also be, but the problem is of general algorithmic nature.

Here is a very inefficient method that should do what you want:

w = ['zero', 'one']
x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']

def get_score(m, k):
    v = m[k]
    return sum(get_score(m, kk) for kk in v) + 1

m = {}
for lst in [w,x,y,z]:
    for (i,src) in enumerate(lst):
        if src not in m: m[src] = []
        for (j,dst) in enumerate(lst[i+1:]):
            m[src].append(dst)

scored_u = [(k,get_score(m,k)) for k in m]
scored_s = sorted(scored_u, key=lambda (k,s): s, reverse=True)

for (k,s) in scored_s:
    print(k,s)

Output:

('zero', 13)
('one', 12)
('two', 6)
('three', 3)
('four', 1)
('five', 1)

The approach first builds a mapping m where the keys are the terms of the lists and the values are a list of terms that are found to have followed the key.

So in this case, m looks like:

{
  'three': ['five', 'four'], 
  'two':   ['four', 'three', 'five'], 
  'four':  [], 
  'zero':  ['one'], 
  'five':  [], 
  'one':   ['two', 'four', 'three', 'four']
}

From there, it computes a score for each key. The score is defined by the sum of the scores of the elements that have been seen to follow it, plus 1.

So

get_score(m, 'four') = 1
get_score(m, 'five') = 1
# and thus
get_score(m, 'three') = 3  # (1(four) + 1(five) + 1)

It does this for each element found in the input lists (in my case w,x,y,z ) and computes the total score, then sorts it by score, descending.

I say this is inefficient because this get_score could be memoized, so that you only had to determine the score of a key once. You'd likely do this via backtracking -- compute the scores of keys where the value was an empty list, and work backwards. In the current implementation, it determines the score for some keys multiple times.

Note: All this guarantees is that an element's score won't be lower than where it "is expected". For example, adding

v = ['one-point-five', 'four']

Into the mix will place one-point-five above four on the list, but since you're only referencing it once, in v , there's not enough context to do a better job.

Just for completeness, this is how I ended up solving the problem:

As pointed out by @DSM, this problem relates to topological sorting . There are third-party modules out there, eg. toposort (plain Python, no dependencies).

The sequences need to be converted into a mapping format, similar to the ones also used/suggested in other answers. toposort_flatten() then does the rest:

from collections import defaultdict
from toposort import toposort_flatten

def merge_seqs(*seqs):
    '''Merge sequences that share a hidden order.'''
    order_map = defaultdict(set)
    for s in seqs:
        for i, elem in enumerate(s):
            order_map[elem].update(s[:i])
    return toposort_flatten(dict(order_map))

With the above example:

>>> w = ['zero', 'one']
>>> x = ['one', 'two', 'four']
>>> y = ['two', 'three', 'five']
>>> z = ['one', 'three', 'four']
>>> merge_seqs(w, x, y, z)
['zero', 'one', 'two', 'three', 'five', 'four']

Your problem is all about relation in discrete mathematics that all the combinations pairs in your arrays have transitive relation together which means that if a>b and b>c then a>c . Because of that, you can create the following lists , so in a set with length 5 the smallest element should be in 4 of these pairs --if we have such number of pairs for one. So first we need to create these pairs that are grouped by the first element, for that we can use groupby and chain functions from itertools module :

>>> from itertools import combinations,chain,groupby
>>> from operator import itemgetter

>>> l1= [list(g) for _,g in groupby(sorted(chain.from_iterable(combinations(i,2) for i in [x,y,z])),key=itemgetter(0))]
[[('one', 'four'), ('one', 'four'), ('one', 'three'), ('one', 'two')], [('three', 'five'), ('three', 'four')], [('two', 'five'), ('two', 'four'), ('two', 'three')]]

So if we have the groups with len 4 ,3 ,2, 1 then we have found the answer but if we didn't find such sequence we can do the preceding calculation reversely to find our elements with this logic that if we find a relation group with len 4 its the biggest number and ...!

>>> l2= [list(g) for _,g in groupby(sorted(chain.from_iterable(combinations(i,2) for i in [x,y,z]),key=itemgetter(1)),key=itemgetter(1))]
    [[('two', 'five'), ('three', 'five')], [('one', 'four'), ('two', 'four'), ('one', 'four'), ('three', 'four')], [('two', 'three'), ('one', 'three')], [('one', 'two')]]

So we can do the following :

Note that we need to use set(zip(*i)[1]) to get the set of elements that our specific elements is in relation with them,then use len to calculate the number of those elements.

>>> [(i[0][0],len(set(zip(*i)[1]))) for i in l1]
[('one', 3), ('three', 2), ('two', 3)]
>>> [(i[0][1],len(set(zip(*i)[0]))) for i in l2]
[('five', 2), ('four', 3), ('three', 2), ('two', 1)]

In first part we found the 4,2,3 so now we just need to find the 1 that its could be four or five .now we go to second part that we need to find a sequence with length 4 or 3 that the four is 3 so the 4th element has been found thus 5th element should be five .

Edit: as a more elegant and faster way you can do the job with collections.defaultdict :

>>> from collections import defaultdict
>>> d=defaultdict(set)
>>> for i,j in chain.from_iterable(combinations(i,2) for i in [x,y,z]) :
...          d[i].add(j)
... 
>>> d
defaultdict(<type 'set'>, {'three': set(['four', 'five']), 'two': set(['four', 'five', 'three']), 'one': set(['four', 'two', 'three'])})
>>> l1=[(k,len(v)) for k,v in d.items()]
>>> l1
[('three', 2), ('two', 3), ('one', 3)]
>>> d=defaultdict(set)
>>> for i,j in chain.from_iterable(combinations(i,2) for i in [x,y,z]) :
...          d[j].add(i) #create dict reversely 
... 
>>> l2=[(k,len(v)) for k,v in d.items()]
>>> l2
[('four', 3), ('five', 2), ('two', 1), ('three', 2)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM