简体   繁体   English

合并共享公共元素的列表

[英]Merge lists that share common elements

My input is a list of lists.我的输入是一个列表列表。 Some of them share common elements, eg.其中一些共享公共元素,例如。

L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]

I need to merge all lists, that share a common element, and repeat this procedure as long as there are no more lists with the same item.我需要合并所有共享一个公共元素的列表,只要没有更多列表具有相同的项目,就重复此过程。 I thought about using boolean operations and a while loop, but couldn't come up with a good solution.我想过使用布尔运算和 while 循环,但想不出一个好的解决方案。

The final result should be:最终结果应该是:

L = [['a','b','c','d','e','f','g','o','p'],['k']] 

You can see your list as a notation for a Graph, ie ['a','b','c'] is a graph with 3 nodes connected to each other.您可以将列表视为图形的符号,即['a','b','c']是具有 3 个相互连接的节点的图形。 The problem you are trying to solve is finding connected components in this graph .您要解决的问题是在此图中找到连通分量

You can use NetworkX for this, which has the advantage that it's pretty much guaranteed to be correct:您可以为此使用NetworkX ,它的优点是几乎可以保证正确:

l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]

import networkx 
from networkx.algorithms.components.connected import connected_components


def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    

G = to_graph(l)
print connected_components(G)
# prints [['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p'], ['k']]

To solve this efficiently yourself you have to convert the list into something graph-ish anyways, so you might as well use networkX from the start.为了自己有效地解决这个问题,无论如何你必须将列表转换为图形化的东西,所以你最好从一开始就使用 networkX。

Algorithm:算法:

  1. take first set A from list从列表中取出第一组 A
  2. for each other set B in the list do if B has common element(s) with A join B into A;对于列表中的每个其他集合 B,如果 B 具有公共元素,并且 A 将 B 连接到 A; remove B from list从列表中删除 B
  3. repeat 2. until no more overlap with A重复 2. 直到不再与 A 重叠
  4. put A into outpup将 A 放入输出
  5. repeat 1. with rest of list重复 1. 与列表的其余部分

So you might want to use sets instead of list.所以你可能想使用集合而不是列表。 The following program should do it.下面的程序应该可以做到。

l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]

out = []
while len(l)>0:
    first, *rest = l
    first = set(first)

    lf = -1
    while len(first)>lf:
        lf = len(first)

        rest2 = []
        for r in rest:
            if len(first.intersection(set(r)))>0:
                first |= set(r)
            else:
                rest2.append(r)     
        rest = rest2

    out.append(first)
    l = rest

print(out)

I needed to perform the clustering technique described by the OP millions of times for rather large lists, and therefore wanted to determine which of the methods suggested above is both most accurate and most performant.我需要为相当大的列表执行 OP 描述的聚类技术数百万次,因此想确定上面建议的哪种方法最准确和性能最好。

I ran 10 trials for input lists sized from 2^1 through 2^10 for each method above, using the same input list for each method, and measured the average runtime for each algorithm proposed above in milliseconds.对于上述每种方法,我对大小从 2^1 到 2^10 的输入列表进行了 10 次试验,对每种方法使用相同的输入列表,并以毫秒为单位测量了上述每种算法的平均运行时间。 Here are the results:结果如下:

在此处输入图片说明

These results helped me see that of the methods that consistently return correct results, @jochen's is the fastest.这些结果帮助我看到在始终返回正确结果的方法中,@jochen 是最快的。 Among those methods that don't consistently return correct results, mak's solution often does not include all of the input elements (ie list of list members are missing), and the solutions of braaksma, cmangla, and asterisk are not guaranteed to be maximally merged.在那些不能始终返回正确结果的方法中,mak的解决方案往往不包括所有的输入元素(即列表成员列表缺失),并且不保证braaksma、cmangla和asterisk的解决方案最大程度地合并.

It's interesting that the two fastest, correct algorithms have the top two amount of upvotes to date, in properly ranked order.有趣的是,这两个最快、正确的算法按正确的排序顺序获得了迄今为止的前两名赞成票。

Here's the code used to run the tests:这是用于运行测试的代码:

from networkx.algorithms.components.connected import connected_components
from itertools import chain
from random import randint, random
from collections import defaultdict, deque
from copy import deepcopy
from multiprocessing import Pool
import networkx
import datetime
import os

##
# @mimomu
##

def mimomu(l):
  l = deepcopy(l)
  s = set(chain.from_iterable(l))
  for i in s:
    components = [x for x in l if i in x]
    for j in components:
      l.remove(j)
    l += [list(set(chain.from_iterable(components)))]
  return l

##
# @Howard
##

def howard(l):
  out = []
  while len(l)>0:
      first, *rest = l
      first = set(first)

      lf = -1
      while len(first)>lf:
          lf = len(first)

          rest2 = []
          for r in rest:
              if len(first.intersection(set(r)))>0:
                  first |= set(r)
              else:
                  rest2.append(r)
          rest = rest2

      out.append(first)
      l = rest
  return out

##
# Nx @Jochen Ritzel
##

def jochen(l):
  l = deepcopy(l)

  def to_graph(l):
      G = networkx.Graph()
      for part in l:
          # each sublist is a bunch of nodes
          G.add_nodes_from(part)
          # it also imlies a number of edges:
          G.add_edges_from(to_edges(part))
      return G

  def to_edges(l):
      """
          treat `l` as a Graph and returns it's edges
          to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
      """
      it = iter(l)
      last = next(it)

      for current in it:
          yield last, current
          last = current

  G = to_graph(l)
  return list(connected_components(G))

##
# Merge all @MAK
##

def mak(l):
  l = deepcopy(l)
  taken=[False]*len(l)
  l=map(set,l)

  def dfs(node,index):
      taken[index]=True
      ret=node
      for i,item in enumerate(l):
          if not taken[i] and not ret.isdisjoint(item):
              ret.update(dfs(item,i))
      return ret

  def merge_all():
      ret=[]
      for i,node in enumerate(l):
          if not taken[i]:
              ret.append(list(dfs(node,i)))
      return ret

  result = list(merge_all())
  return result

##
# @cmangla
##

def cmangla(l):
  l = deepcopy(l)
  len_l = len(l)
  i = 0
  while i < (len_l - 1):
    for j in range(i + 1, len_l):
      # i,j iterate over all pairs of l's elements including new
      # elements from merged pairs. We use len_l because len(l)
      # may change as we iterate
      i_set = set(l[i])
      j_set = set(l[j])

      if len(i_set.intersection(j_set)) > 0:
        # Remove these two from list
        l.pop(j)
        l.pop(i)

        # Merge them and append to the orig. list
        ij_union = list(i_set.union(j_set))
        l.append(ij_union)

        # len(l) has changed
        len_l -= 1

        # adjust 'i' because elements shifted
        i -= 1

        # abort inner loop, continue with next l[i]
        break

      i += 1
  return l

##
# @pillmuncher
##

def pillmuncher(l):
  l = deepcopy(l)

  def connected_components(lists):
    neighbors = defaultdict(set)
    seen = set()
    for each in lists:
        for item in each:
            neighbors[item].update(each)
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        nodes = set([node])
        next_node = nodes.pop
        while nodes:
            node = next_node()
            see(node)
            nodes |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node))

  return list(connected_components(l))

##
# @NicholasBraaksma
##

def braaksma(l):
  l = deepcopy(l)
  lists = sorted([sorted(x) for x in l]) #Sorts lists in place so you dont miss things. Trust me, needs to be done.

  resultslist = [] #Create the empty result list.

  if len(lists) >= 1: # If your list is empty then you dont need to do anything.
      resultlist = [lists[0]] #Add the first item to your resultset
      if len(lists) > 1: #If there is only one list in your list then you dont need to do anything.
          for l in lists[1:]: #Loop through lists starting at list 1
              listset = set(l) #Turn you list into a set
              merged = False #Trigger
              for index in range(len(resultlist)): #Use indexes of the list for speed.
                  rset = set(resultlist[index]) #Get list from you resultset as a set
                  if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1
                      resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset
                      merged = True #Turn trigger to True
                      break #Because you found a match there is no need to continue the for loop.
              if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out.
                  resultlist.append(l)
  return resultlist

##
# @Rumple Stiltskin
##

def stiltskin(l):
  l = deepcopy(l)
  hashdict = defaultdict(int)

  def hashit(x, y):
      for i in y: x[i] += 1
      return x

  def merge(x, y):
      sums = sum([hashdict[i] for i in y])
      if sums > len(y):
          x[0] = x[0].union(y)
      else:
          x[1] = x[1].union(y)
      return x

  hashdict = reduce(hashit, l, hashdict)
  sets = reduce(merge, l, [set(),set()])
  return list(sets)

##
# @Asterisk
##

def asterisk(l):
  l = deepcopy(l)
  results = {}
  for sm in ['min', 'max']:
    sort_method = min if sm == 'min' else max
    l = sorted(l, key=lambda x:sort_method(x))
    queue = deque(l)

    grouped = []
    while len(queue) >= 2:
      l1 = queue.popleft()
      l2 = queue.popleft()
      s1 = set(l1)
      s2 = set(l2)

      if s1 & s2:
        queue.appendleft(s1 | s2)
      else:
        grouped.append(s1)
        queue.appendleft(s2)
    if queue:
      grouped.append(queue.pop())
    results[sm] = grouped
  if len(results['min']) < len(results['max']):
    return results['min']
  return results['max']

##
# Validate no more clusters can be merged
##

def validate(output, L):
  # validate all sublists are maximally merged
  d = defaultdict(list)
  for idx, i in enumerate(output):
    for j in i:
      d[j].append(i)
  if any([len(i) > 1 for i in d.values()]):
    return 'not maximally merged'
  # validate all items in L are accounted for
  all_items = set(chain.from_iterable(L))
  accounted_items = set(chain.from_iterable(output))
  if all_items != accounted_items:
    return 'missing items'
  # validate results are good
  return 'true'

##
# Timers
##

def time(func, L):
  start = datetime.datetime.now()
  result = func(L)
  delta = datetime.datetime.now() - start
  return result, delta

##
# Function runner
##

def run_func(args):
  func, L, input_size = args
  results, elapsed = time(func, L)
  validation_result = validate(results, L)
  return func.__name__, input_size, elapsed, validation_result

##
# Main
##

all_results = defaultdict(lambda: defaultdict(list))
funcs = [mimomu, howard, jochen, mak, cmangla, braaksma, asterisk]
args = []

for trial in range(10):
  for s in range(10):
    input_size = 2**s

    # get some random inputs to use for all trials at this size
    L = []
    for i in range(input_size):
      sublist = []
      for j in range(randint(5, 10)):
        sublist.append(randint(0, 2**24))
      L.append(sublist)
    for i in funcs:
      args.append([i, L, input_size])

pool = Pool()
for result in pool.imap(run_func, args):
  func_name, input_size, elapsed, validation_result = result
  all_results[func_name][input_size].append({
    'time': elapsed,
    'validation': validation_result,
  })
  # show the running time for the function at this input size
  print(input_size, func_name, elapsed, validation_result)
pool.close()
pool.join()

# write the average of time trials at each size for each function
with open('times.tsv', 'w') as out:
  for func in all_results:
    validations = [i['validation'] for j in all_results[func] for i in all_results[func][j]]
    linetype = 'incorrect results' if any([i != 'true' for i in validations]) else 'correct results'

    for input_size in all_results[func]:
      all_times = [i['time'].microseconds for i in all_results[func][input_size]]
      avg_time = sum(all_times) / len(all_times)

      out.write(func + '\t' + str(input_size) + '\t' + \
        str(avg_time) + '\t' + linetype + '\n')

And for plotting:并用于绘图:

library(ggplot2)
df <- read.table('times.tsv', sep='\t')

p <- ggplot(df, aes(x=V2, y=V3, color=as.factor(V1))) +
  geom_line() +
  xlab('number of input lists') +
  ylab('runtime (ms)') +
  labs(color='') +
  scale_x_continuous(trans='log10') +
  facet_wrap(~V4, ncol=1)

ggsave('runtimes.png')

I think this can be solved by modelling the problem as a graph .我认为这可以通过将问题建模为图形来解决。 Each sublist is a node and shares an edge with another node only if the two sublists have some element in common.每个子列表都是一个节点,并且仅当两个子列表具有某些共同元素时才与另一个节点共享一条边。 Thus, a merged sublist is basically aconnected component in the graph.因此,合并的子列表基本上是图中的连接组件 Merging all of them is simply a matter of finding all connected components and listing them.合并所有这些只是找到所有连接的组件并列出它们的问题。

This can be done by a simple traversal over the graph.这可以通过简单的遍历图来完成。 Both BFS and DFS can be used, but I'm using DFS here since it is somewhat shorter for me. BFSDFS都可以使用,但我在这里使用 DFS,因为它对我来说有点短。

l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
taken=[False]*len(l)
l=[set(elem) for elem in l]

def dfs(node,index):
    taken[index]=True
    ret=node
    for i,item in enumerate(l):
        if not taken[i] and not ret.isdisjoint(item):
            ret.update(dfs(item,i))
    return ret

def merge_all():
    ret=[]
    for i,node in enumerate(l):
        if not taken[i]:
            ret.append(list(dfs(node,i)))
    return ret

print(merge_all())

I came across the same issue of trying to merge down lists with common values.我遇到了尝试合并具有共同值的列表的相同问题。 This example may be what you are looking for.这个例子可能就是你要找的。 It only loops over lists once and updates resultset as it goes.它只在列表上循环一次并在运行时更新结果集。

lists = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
lists = sorted([sorted(x) for x in lists]) #Sorts lists in place so you dont miss things. Trust me, needs to be done.

resultslist = [] #Create the empty result list.

if len(lists) >= 1: # If your list is empty then you dont need to do anything.
    resultlist = [lists[0]] #Add the first item to your resultset
    if len(lists) > 1: #If there is only one list in your list then you dont need to do anything.
        for l in lists[1:]: #Loop through lists starting at list 1
            listset = set(l) #Turn you list into a set
            merged = False #Trigger
            for index in range(len(resultlist)): #Use indexes of the list for speed.
                rset = set(resultlist[index]) #Get list from you resultset as a set
                if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1
                    resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset
                    merged = True #Turn trigger to True
                    break #Because you found a match there is no need to continue the for loop.
            if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out.
                resultlist.append(l)
print resultlist

# #

resultset = [['a', 'b', 'c', 'd', 'e', 'g', 'f', 'o', 'p'], ['k']]

As Jochen Ritzel pointed out you are looking for connected components in a graph.正如Jochen Ritzel 指出的,您正在寻找图中的连通分量。 Here is how you could implement it without using a graph library:以下是不使用图形库即可实现它的方法:

from collections import defaultdict

def connected_components(lists):
    neighbors = defaultdict(set)
    seen = set()
    for each in lists:
        for item in each:
            neighbors[item].update(each)
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        nodes = set([node])
        next_node = nodes.pop
        while nodes:
            node = next_node()
            see(node)
            nodes |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node))

L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
print list(connected_components(L))

You can use networkx library because is a graph theory and connected components problem:您可以使用 networkx 库,因为这是一个图论连通分量问题:

import networkx as nx

L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]

G = nx.Graph()

#Add nodes to Graph    
G.add_nodes_from(sum(L, []))

#Create edges from list of nodes
q = [[(s[i],s[i+1]) for i in range(len(s)-1)] for s in L]

for i in q:

    #Add edges to Graph
    G.add_edges_from(i)

#Find all connnected components in graph and list nodes for each component
[list(i) for i in nx.connected_components(G)]

Output:输出:

[['p', 'c', 'f', 'g', 'o', 'a', 'd', 'b', 'e'], ['k']]

I have found itertools a fast option for merging lists and it solved this problem for me:我发现 itertools 是合并列表的快速选项,它为我解决了这个问题:

import itertools

LL = set(itertools.chain.from_iterable(L)) 
# LL is {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'k', 'o', 'p'}

for each in LL:
  components = [x for x in L if each in x]
  for i in components:
    L.remove(i)
  L += [list(set(itertools.chain.from_iterable(components)))]

# then L = [['k'], ['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p']]

For large sets sorting LL by frequency from the most common elements to the least can speed things up a bit对于大集合,按频率从最常见的元素到最少的元素排序 LL 可以加快速度

I miss a non quirurgic version.我想念一个非古怪的版本。 I post it on 2018 (7 years later)我在 2018 年发布(7 年后)

An easy and understable approach:一种简单而不稳定的方法:

1) make cartesian product ( cross join ) merging both if elements in common 1)使笛卡尔积(交叉连接)合并两个如果元素相同
2) remove dups 2)删除重复

#your list
l=[['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]

#import itertools
from itertools import product, groupby

#inner lists to sets (to list of sets)
l=[set(x) for x in l]

#cartesian product merging elements if some element in common
for a,b in product(l,l):
    if a.intersection( b ):
       a.update(b)
       b.update(a)

#back to list of lists
l = sorted( [sorted(list(x)) for x in l])

#remove dups
list(l for l,_ in groupby(l))

#result
[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']]

This is a fairly fast solution with no dependencies.这是一个相当快的解决方案,没有依赖性。 It works as follows:它的工作原理如下:

  1. Assign a unique reference number to each of your subsists (in this case, the initial index of the sublist)为每个子列表分配一个唯一的参考号(在这种情况下,子列表的初始索引)

  2. Create a dictionary of the reference elements for each sublist, and for each item in each sublist.为每个子列表和每个子列表中的每个项目创建一个包含引用元素的字典。

  3. Repeat the following procedure until it causes no changes:重复以下过程,直到它不会引起任何变化:

    3a. 3a. Go through each item in each sublist.浏览每个子列表中的每个项目。 If that item's current reference number is different from the reference number of its sublist, then the element must be part of two lists.如果该项目的当前引用编号与其子列表的引用编号不同,则该元素必须是两个列表的一部分。 Merge the two lists (removing the current sublist from the reference) and set the reference number of all items in the current sublist to be the reference number of the new sublist.合并两个列表(从引用中移除当前子列表),将当前子列表中所有项的引用号设置为新子列表的引用号。

When this procedure causes no changes, it is because all elements are part of exactly one list.当此过程不会引起任何更改时,这是因为所有元素都恰好是一个列表的一部分。 Since working set is decreasing in size on every iteration, the algorithm necessarily terminates.由于每次迭代工作集的大小都在减小,因此算法必然会终止。

   def merge_overlapping_sublists(lst):
    output, refs = {}, {}
    for index, sublist in enumerate(lst):
        output[index] = set(sublist)
        for elem in sublist:
            refs[elem] = index
    changes = True
    while changes:
        changes = False
        for ref_num, sublist in list(output.items()):
            for elem in sublist:
                current_ref_num = refs[elem]
                if current_ref_num != ref_num:
                    changes = True
                    output[current_ref_num] |= sublist
                    for elem2 in sublist:
                        refs[elem2] = current_ref_num
                    output.pop(ref_num)
                    break
    return list(output.values())

Here are a set of tests for this code:以下是此代码的一组测试:

def compare(a, b):
    a = list(b)
    try:
        for elem in a:
            b.remove(elem)
    except ValueError:
        return False
    return not b

import random
lst = [["a", "b"], ["b", "c"], ["c", "d"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c", "d", "e"}])
lst = [["a", "b"], ["b", "c"], ["f", "d"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c",}, {"d", "e", "f"}])
lst = [["a", "b"], ["k", "c"], ["f", "g"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b"}, {"k", "c"}, {"f", "g"}, {"d", "e"}])
lst = [["a", "b", "c"], ["b", "d", "e"], ["k"], ["o", "p"], ["e", "f"], ["p", "a"], ["d", "g"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"k"}, {"a", "c", "b", "e", "d", "g", "f", "o", "p"}])    
lst = [["a", "b"], ["b", "c"], ["a"], ["a"], ["b"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c"}])

Note that the return value is a list of sets.请注意,返回值是一个集合列表。

My attempt.我的尝试。 Has functional look to it.具有功能性的外观。

#!/usr/bin/python
from collections import defaultdict
l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
hashdict = defaultdict(int)

def hashit(x, y):
    for i in y: x[i] += 1
    return x

def merge(x, y):
    sums = sum([hashdict[i] for i in y])
    if sums > len(y):
        x[0] = x[0].union(y)
    else:
        x[1] = x[1].union(y)
    return x


hashdict = reduce(hashit, l, hashdict)
sets = reduce(merge, l, [set(),set()])
print [list(sets[0]), list(sets[1])]

This is perhaps a simpler/faster algorithm and seems to work well -这可能是一种更简单/更快的算法,并且似乎运行良好 -

l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]

len_l = len(l)
i = 0
while i < (len_l - 1):
    for j in range(i + 1, len_l):

        # i,j iterate over all pairs of l's elements including new 
        # elements from merged pairs. We use len_l because len(l)
        # may change as we iterate
        i_set = set(l[i])
        j_set = set(l[j])

        if len(i_set.intersection(j_set)) > 0:
            # Remove these two from list
            l.pop(j)
            l.pop(i)

            # Merge them and append to the orig. list
            ij_union = list(i_set.union(j_set))
            l.append(ij_union)

            # len(l) has changed
            len_l -= 1

            # adjust 'i' because elements shifted
            i -= 1

            # abort inner loop, continue with next l[i]
            break

    i += 1

print l
# prints [['k'], ['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p']]

Without knowing quite what you want, I decided to just guess you meant: I want to find every element just once.在完全不知道您想要什么的情况下,我决定猜测您的意思:我只想找到每个元素一次。

#!/usr/bin/python


def clink(l, acc):
  for sub in l:
    if sub.__class__ == list:
      clink(sub, acc)
    else:
      acc[sub]=1

def clunk(l):
  acc = {}
  clink(l, acc)
  print acc.keys()

l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]

clunk(l)

Output looks like:输出看起来像:

['a', 'c', 'b', 'e', 'd', 'g', 'f', 'k', 'o', 'p']

Simply speaking, you can use quick-find.简单地说,您可以使用快速查找。

The key is to use two temporary list.关键是使用两个临时列表。 The first is called elements , which stores all elements existing in all groups.第一个称为元素,它存储所有组中存在的所有元素。 The second is named labels .第二个是命名标签 I got the inspiration from sklearn's kmeans algorithm.我的灵感来自 sklearn 的 kmeans 算法。 'labels' stores the labels or the centroids of the elements. 'labels' 存储元素的标签或质心。 Here I simply let the first element in the cluster be the centroid.这里我只是让簇中的第一个元素作为质心。 Initially, the values are from 0 to length-1, ascendingly.最初,值从 0 到 length-1,按升序排列。

For each group, I get their 'indices' in 'elements'.对于每个组,我在“元素”中获取他们的“索引”。 I then get the labels for group according to indices.然后我根据索引获取组标签 And I calculate the min of the labels, that will be the new label for them.我计算了标签的最小值,这将是它们的新标签。 I replace all elements with labels in labels for group with the new label.我用新标签替换了组标签中的所有元素。

Or to say, for each iteration, I try to combine two or more existing groups.或者说,对于每次迭代,我尝试组合两个或多个现有组。 If the group has labels of 0 and 2 I find out the new label 0, that is the min of the two.如果该组的标签为 0 和 2,我会找出新标签 0,即两者的最小值。 I than replace them with 0.我比用0替换它们。

def cluser_combine(groups):
    n_groups=len(groups)

    #first, we put all elements appeared in 'gruops' into 'elements'.
    elements=list(set.union(*[set(g) for g in groups]))
    #and sort elements.
    elements.sort()
    n_elements=len(elements)

    #I create a list called clusters, this is the key of this algorithm.
    #I was inspired by sklearn kmeans implementation.
    #they have an attribute called labels_
    #the url is here:
    #https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
    #i called this algorithm cluster combine, because of this inspiration.
    labels=list(range(n_elements))


    #for each group, I get their 'indices' in 'elements'
    #I then get the labels for indices.
    #and i calculate the min of the labels, that will be the new label for them.
    #I replace all elements with labels in labels_for_group with the new label.

    #or to say, for each iteration,
    #i try to combine two or more existing groups.
    #if the group has labels of 0 and 2
    #i find out the new label 0, that is the min of the two.
    #i than replace them with 0.
    for i in range(n_groups):

        #if there is only zero/one element in the group, skip
        if len(groups[i])<=1:
            continue

        indices=list(map(elements.index, groups[i]))

        labels_for_group=list(set([labels[i] for i in indices]))
        #if their is only one label, all the elements in group are already have the same label, skip.
        if len(labels_for_group)==1:

            continue

        labels_for_group.sort()
        label=labels_for_group[0]

        #combine
        for k in range(n_elements):
            if labels[k] in labels_for_group[1:]:
                labels[k]=label


    new_groups=[]
    for label in set(labels):
        new_group = [elements[i] for i, v in enumerate(labels) if v == label]
        new_groups.append(new_group)

    return new_groups

I printed out the detailed result for your question:我打印了您问题的详细结果:

cluser_combine([['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']])

elements:元素:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'k', 'o', 'p'] ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'k', 'o', 'p']
labels:标签:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
--------------------group 0------------------------- --------------------第 0 组----------------------------
the group is:该组是:
['a', 'b', 'c'] ['a', 'b', 'c']
indices for the group in elements元素中组的索引
[0, 1, 2] [0, 1, 2]
labels before combination组合前的标签
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 3, 4, 5, 6, 7, 8, 9] [0, 0, 0, 3, 4, 5, 6, 7, 8, 9]
--------------------group 1------------------------- --------------------第一组-------------------------
the group is:该组是:
['b', 'd', 'e'] ['b', 'd', 'e']
indices for the group in elements元素中组的索引
[1, 3, 4] [1, 3, 4]
labels before combination组合前的标签
[0, 0, 0, 3, 4, 5, 6, 7, 8, 9] [0, 0, 0, 3, 4, 5, 6, 7, 8, 9]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 0, 0, 5, 6, 7, 8, 9] [0, 0, 0, 0, 0, 5, 6, 7, 8, 9]
--------------------group 2------------------------- --------------------第2组-------------------------
the group is:该组是:
['k'] ['k']
--------------------group 3------------------------- --------------------第三组----------------------------
the group is:该组是:
['o', 'p'] ['o', 'p']
indices for the group in elements元素中组的索引
[8, 9] [8, 9]
labels before combination组合前的标签
[0, 0, 0, 0, 0, 5, 6, 7, 8, 9] [0, 0, 0, 0, 0, 5, 6, 7, 8, 9]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 0, 0, 5, 6, 7, 8, 8] [0, 0, 0, 0, 0, 5, 6, 7, 8, 8]
--------------------group 4------------------------- --------------------第4组-------------------------
the group is:该组是:
['e', 'f'] ['e', 'f']
indices for the group in elements元素中组的索引
[4, 5] [4, 5]
labels before combination组合前的标签
[0, 0, 0, 0, 0, 5, 6, 7, 8, 8] [0, 0, 0, 0, 0, 5, 6, 7, 8, 8]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 0, 0, 0, 6, 7, 8, 8] [0, 0, 0, 0, 0, 0, 6, 7, 8, 8]
--------------------group 5------------------------- ---------------------第5组-------------------------
the group is:该组是:
['p', 'a'] ['p', 'a']
indices for the group in elements元素中组的索引
[9, 0] [9, 0]
labels before combination组合前的标签
[0, 0, 0, 0, 0, 0, 6, 7, 8, 8] [0, 0, 0, 0, 0, 0, 6, 7, 8, 8]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 0, 0, 0, 6, 7, 0, 0] [0, 0, 0, 0, 0, 0, 6, 7, 0, 0]
--------------------group 6------------------------- ---------------------第6组-------------------------
the group is:该组是:
['d', 'g'] ['d', 'g']
indices for the group in elements元素中组的索引
[3, 6] [3, 6]
labels before combination组合前的标签
[0, 0, 0, 0, 0, 0, 6, 7, 0, 0] [0, 0, 0, 0, 0, 0, 6, 7, 0, 0]
combining...结合...
labels after combination组合后的标签
[0, 0, 0, 0, 0, 0, 0, 7, 0, 0] [0, 0, 0, 0, 0, 0, 0, 7, 0, 0]
([0, 0, 0, 0, 0, 0, 0, 7, 0, 0], ([0, 0, 0, 0, 0, 0, 0, 7, 0, 0],
[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']]) [['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']])

Please refer to my github jupyter notebook for details详情请参考我的github jupyter notebook

Here's my answer.这是我的答案。

orig = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g'], ['k'],['k'],['k']]

def merge_lists(orig):
    def step(orig): 
        mid = []
        mid.append(orig[0])
        for i in range(len(mid)):            
            for j in range(1,len(orig)):                
                for k in orig[j]:
                    if k in mid[i]:                
                        mid[i].extend(orig[j])                
                        break
                    elif k == orig[j][-1] and orig[j] not in mid:
                        mid.append(orig[j])                        
        mid = [sorted(list(set(x))) for x in mid]
        return mid

    result = step(orig)
    while result != step(result):                    
        result = step(result)                  
    return result

merge_lists(orig)
[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM