简体   繁体   中英

Python clustering list of sets

For a collection of sets, I'll need to group them into multiple "cluster" that all the node in the cluster is a subset of the largest node in that cluster. Eg

input = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

will get:

[{1, 2, 3}, {1, 2}, {1}], 
[{1, 2, 4}, {1, 2}, {1, 4}, {1}]

I've tried building a subset tree with reference of this , but it soon become very slow when input is large since it's iterating all children for every insertion.

I'm not familiar with k-mean clustering, but does it apply to the problem here?

What is the most efficient way of doing it?

First sort the list in descending order of lengths. This way you start with the longest sets which surely are not subsets of any other.

Then, save each representative set as the key (after converted to a tuple) of a dict with list values.

For each set, check if it's a subset of any key and add it to the respective list.

Only if it wasn't added to any key, it means it's a new representative.

In the end, take the values() of the result dict:

l = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

grouped_sets = {}
for cur_set in sorted(l, key=len, reverse=True):
    is_subset = False
    for represent, sets in grouped_sets.items():
        if cur_set.issubset(represent):
            sets.append(cur_set)
            is_subset = True

    if not is_subset:
        grouped_sets[tuple(cur_set)] = [cur_set]

print(list(grouped_sets.values()))

Which gives:

[[{1, 2, 3}, {1, 2}, {1}], 
 [{1, 2, 4}, {1, 2}, {1, 4}, {1}]]

Perhaps sorting the sets by decreasing order of length will reduce the number of intersections to be made down to one per cluster per set. This will be data dependent and won't improve if there are no subsets but it should improve as the clusters are bigger:

setList = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

groups = []
for aSet in sorted(setList,key=len,reverse=True):
    clusters = [g for g in groups if g[0].issuperset(aSet)]
    if not clusters:
        groups.append([])
        clusters = groups[-1:]
    for g in clusters:
        g.append(aSet)

print(groups)

[[{1, 2, 3}, {1, 2}, {1}], [{1, 2, 4}, {1, 2}, {1, 4}, {1}]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM