简体   繁体   中英

Yet another merging list of lists, but most pythonic way

I tried finding the answers of few selected answers, but they doesn't seem working. I have to merge list of lists.

[[283, 311], [311, 283, 316], [150, 68], [86, 119], [259, 263], [263, 259, 267], [118, 87], [262, 264], [264, 262], [85, 115], [115, 85], [244, 89], [140, 76, 82, 99], [236, 111], [330, 168], [76, 63, 107, 124, 128, 135, 140], [131, 254], [254, 131], [21, 38], [38, 21], [220, 291], [291, 220], [296, 46], [64, 53, 57, 61, 63, 65, 66, 76, 96, 100, 103, 114, 123, 127, 128, 130, 148, 149], [274, 240], [157, 225, 234], [225, 247], [233, 44], [89, 244], [80, 101], [210, 214], [78, 155], [55, 139], [102, 74, 75, 132], [105, 252], [149, 55, 59, 63, 71, 73, 81, 100, 102, 116, 122, 138, 146], [97, 231], [231, 97], [155, 78], [239, 305], [305, 239], [145, 94, 248], [147, 150], [61, 64], [152, 219], [219, 152], [246, 250], [252, 105], [223, 235], [235, 223], [237, 60, 344], [344, 237], [182, 129], [331, 117], [12, 2, 8, 10, 13, 15], [250, 246]]

All I want is, for example, [283,311] exists in [311,283,316] . then both should be merged making only one list in the above list. I need to merge a list if it exists in some other.

Please note, I can do it using loop inside loop, but looking for a pythonic way to achieve this. Also, if you know how to merge lists sharing atleaast one common element, please share.

Edit :

Please don't forget, looking for a pythonic approach. I think it is not possible using any core Pythonic approach. Then what should be next possible solution using loops. I need efficiency as have to merge list of lists with more than a million items every half an hour. I did using for loop inside for loop then comparing two items, but it is taking too much of time.

To answer your latter question; it's a connected-components problem.

>>> import networkx as nx
>>> G = nx.Graph()
>>> for component in l:
...:     G.add_edges_from(pairs(component))
...:     
>>> nx.connected_components(G)
# the answer

I think this works ...

a = [[283, 311], [311, 283, 316], [150, 68], [86, 119], [259, 263], [263, 259, 267], [118, 87], [262, 264], [264, 262], [85, 115], [115, 85], [244, 89], [140, 76, 82, 99], [236, 111], [330, 168], [76, 63, 107, 124, 128, 135, 140], [131, 254], [254, 131], [21, 38], [38, 21], [220, 291], [291, 220], [296, 46], [64, 53, 57, 61, 63, 65, 66, 76, 96, 100, 103, 114, 123, 127, 128, 130, 148, 149], [274, 240], [157, 225, 234], [225, 247], [233, 44], [89, 244], [80, 101], [210, 214], [78, 155], [55, 139], [102, 74, 75, 132], [105, 252], [149, 55, 59, 63, 71, 73, 81, 100, 102, 116, 122, 138, 146], [97, 231], [231, 97], [155, 78], [239, 305], [305, 239], [145, 94, 248], [147, 150], [61, 64], [152, 219], [219, 152], [246, 250], [252, 105], [223, 235], [235, 223], [237, 60, 344], [344, 237], [182, 129], [331, 117], [12, 2, 8, 10, 13, 15], [250, 246]]
aa = set(map(frozenset,a))  #eliminate duplicates
result =  [x for x in aa if not any(x < y for y in aa)]

Note that this gives you a list of frozenset objects, but you can easily turn that back into a list of lists:

result = [list(fset) for fset in result]

Unfortunately, it has quadratic complexity ... I'm not sure if there's anything to really be done about that ... (keeping it as lists would result in much worse complexity FWIW)

Here's a second approach which is a little better algorithmically:

from itertools import takewhile
def foo2(aa):
    aa = sorted(set(map(frozenset,a)),key=len)
    def conditional(x,aa):
        xlen = len(x)
        return any(x < y for y in takewhile(lambda z: xlen <= len(z),aa))
    return [x for x in aa if not conditional(x,aa)]

Of course, It would need to be timed on you real data to see if it actually performs better (for the test data, the lists aren't big enough and the overhead of generating the conditional function and the takewhile instance and the lambda make it not worthwhile.


for the curious, here's my benchmark:

a = [[283, 311], [311, 283, 316], [150, 68], [86, 119], [259, 263], [263, 259, 267], [118, 87], [262, 264], [264, 262], [85, 115], [115, 85], [244, 89], [140, 76, 82, 99], [236, 111], [330, 168], [76, 63, 107, 124, 128, 135, 140], [131, 254], [254, 131], [21, 38], [38, 21], [220, 291], [291, 220], [296, 46], [64, 53, 57, 61, 63, 65, 66, 76, 96, 100, 103, 114, 123, 127, 128, 130, 148, 149], [274, 240], [157, 225, 234], [225, 247], [233, 44], [89, 244], [80, 101], [210, 214], [78, 155], [55, 139], [102, 74, 75, 132], [105, 252], [149, 55, 59, 63, 71, 73, 81, 100, 102, 116, 122, 138, 146], [97, 231], [231, 97], [155, 78], [239, 305], [305, 239], [145, 94, 248], [147, 150], [61, 64], [152, 219], [219, 152], [246, 250], [252, 105], [223, 235], [235, 223], [237, 60, 344], [344, 237], [182, 129], [331, 117], [12, 2, 8, 10, 13, 15], [250, 246]]

def foo1(aa):
    aa = set(map(frozenset,a))  #eliminate duplicates
    return [x for x in aa if not any(x < y for y in aa)]

from itertools import takewhile
def conditional(x,aa,takewhile=takewhile):
    xlen = len(x)
    return any(x < y for y in takewhile(lambda z: xlen <= len(z),aa))

def foo2(aa):
    aa = sorted(set(map(frozenset,a)),key=len)
    return [x for x in aa if not conditional(x,aa)]

print set(foo1(a)) == set(foo2(a))

import timeit
N = 10000
print timeit.timeit('foo1(a)','from __main__ import foo1,a',number=N)
print timeit.timeit('foo2(a)','from __main__ import foo2,a',number=N)

Here's an alternative approach that I think should be faster than @mgilson's if you have a large number of small sets:

from collections import defaultdict

sets = set(map(frozenset, lists))

def remove_subsets(sets):
    # map each element to the sets in which it occurs
    sets_containing = defaultdict(set)
    for s in sets:
        for x in s:
            sets_containing[x].add(s)

    for s in sets:
        supersets = set.intersection(*(sets_containing[x] for x in s))
        if len(supersets) == 1:
            yield s

The difference is mainly in the final loop, which runs not through all n(n-1)/2 distinct pairs of sets, but only though the n sets in the outer loop, then through a set of candidate supersets that contain some element of the set under consideration. It can be optimized further by stopping the reduce early when it produces an empty set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM