简体   繁体   English

按共享元素分组列表列表

[英]Group list of lists by shared elements

Assume I have the following list of sublists:假设我有以下子列表列表:

l = [['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]

My goal is to rearrange that list into "buckets" in a way such that each sublist in the bucket shares an element with at least one other sublist in the bucket and shares no element with any sublist in a different bucket.我的目标是以某种方式将该列表重新排列为“存储桶”,以便存储桶中的每个子列表与存储桶中的至少一个其他子列表共享一个元素,并且不与不同存储桶中的任何子列表共享任何元素。 It is a little hard understand this in words, but in this case, the desired result would be:用文字来理解这一点有点困难,但在这种情况下,所需的结果是:

result = [
    [
        ['a', 'b'],
        ['a', 'c'],
        ['b', 'c'],
        ['c', 'd']
    ],
    [
        ['e', 'f'],
        ['f', 'g']
    ],
    [
        ['x', 'y']   
    ],
]
          

The idea here is that ['a','b'] goes into the Bucket 1. ['a','b'] shares elements with ['a', 'c'] and ['b', 'c'] , so those go into Bucket 1 as well.这里的想法是['a','b']进入 Bucket 1。 ['a','b']['a', 'c']['b', 'c']共享元素['b', 'c'] ,所以那些也进入 Bucket 1。 Now ['c', 'd'] also shares an element c with the elements currently in Bucket 1, so it gets added to the Bucket 1 as well.现在['c', 'd']也与当前在 Bucket 1 中的元素共享一个元素c ,因此它也被添加到 Bucket 1 中。 After that, there are no more sublists with elements that are shared with those in Bucket 1, so we open a new Bucket 2, starting with ['e', 'f'] .之后,不再有包含与 Bucket 1 中的元素共享的元素的子列表,因此我们打开一个新的 Bucket 2,以['e', 'f']开头。 ['e', 'f'] shares an element with ['f', 'g'] , so that goes into Bucket 2 as well. ['e', 'f']['f', 'g'] ['e', 'f']共享一个元素,因此它也进入 Bucket 2。 Then we are done with Bucket 2. ['x', 'y'] gets its own Bucket 3.然后我们完成了 Bucket 2。 ['x', 'y']获得了自己的 Bucket 3。

I know how to do all of this recursively, but l is very large, and I am wondering whether there is a quicker way to group the elements together!我知道如何递归地完成所有这些,但l非常大,我想知道是否有一种更快的方法将元素组合在一起!

This code seems to work:这段代码似乎有效:

l = [
 ['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]
 
l2 = []

# merge lists to sets
for x in l:
  for x2 in l2:
     if len(x2 & set(x)):
         x2 |= set(x)
         break
  else:
     l2.append(set(x))

# output lists
d = {i:[] for i in range(len(l2))}

# match each list to set
for x in l:
  for k in d:
    if len(set(x) & set(l2[k])):
       d[k].append(x) 

# merge dictionary values
fl = [v for v in d.values()]

print(fl)

Output输出

[[['a', 'b'], 
  ['a', 'c'], 
  ['b', 'c'], 
  ['c', 'd']], 
 [['e', 'f'], 
  ['f', 'g']], 
 [['x', 'y']]]

Here is an alternative using the suggested reduction to a graph problem.这是使用建议的图形问题简化的替代方法。 I hope the code is clear enough, I'll still add a few explanations.我希望代码足够清晰,我仍然会添加一些解释。

Convert to a list of adjacency转换为邻接列表

Just because it's easier to work with:只是因为它更容易使用:

from collections import defaultdict

edges = [
    ['a', 'b'], 
    ['a', 'c'], 
    ['b', 'c'],
    ['c', 'd'],  
    ['e', 'f'], 
    ['f', 'g'], 
    ['x', 'y'],
]

def graph_from_edges(edge):
    graph = defaultdict(set)
    for u, v in edges:
        graph[u].add(v)
        graph[v].add(u)
    return graph

graph = graph_from_edges(edges)

The graph now contains:graph现在包含:

{
    'a': {'c', 'b'}, 
    'b': {'c', 'a'}, 
    'c': {'d', 'b', 'a'}, 
    'd': {'c'}, 
    'e': {'f'}, 
    'f': {'e', 'g'}, 
    'g': {'f'}, 
    'x': {'y'}, 
    'y': {'x'}
}

Find the connected component of a given node找到给定节点的连通分量

This is a simpler sub-problem to solve, we give a node and explore the graph nearby until we only have visited node left available:这是一个更简单的子问题来解决,我们给出一个节点并探索附近的图,直到我们只剩下可用的访问节点:

def connected_component_from(graph, starting_node):
    nodes = set(starting_node)
    visited = set()
    while nodes:
        node = nodes.pop()
        yield node
        visited.add(node)
        nodes |= graph[node] - visited

print(list(connected_component_from(graph, 'a')))

This prints the list of nodes in the connected component of node 'a' :这将打印节点'a'的连接组件中的节点列表:

['a', 'b', 'c', 'd']

Finding all connected components查找所有连接的组件

Now we just need to repeat the previous operation until we have visited all nodes in the graph.现在我们只需要重复前面的操作,直到我们访问了图中的所有节点。 To discover new unexplored components we simply pick a random unvisited node to start over:为了发现新的未开发组件,我们只需选择一个随机未访问的节点来重新开始:

def connected_components(graph):
    all_nodes = set(graph.keys())
    visited = set() 
    while all_nodes - visited:
        starting_node = random_node(all_nodes - visited)
        connected_component = set(connected_component_from(graph, starting_node))
        yield connected_component
        visited |= connected_component

def random_node(nodes):
    return random.sample(nodes, 1)


graph_cc = list(connected_components(graph))
print(graph_cc)

Which prints:哪个打印:

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

Shortcut捷径

You could also use an existing library to compute these connected components for you, for example networkx :您还可以使用现有库为您计算这些连接的组件,例如networkx

import networkx as nx

G = nx.Graph()

G.add_edges_from(edges)
cc = list(nx.connected_components(G))
print(graph_cc)

Which also prints:它还打印:

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

In practice that would be the best solution, but that's less interesting if the goal is to learn new things.在实践中,这将是最好的解决方案,但如果目标是学习新事物,那就没那么有趣了。 Notice that you can view networkx implementation of the function (which uses this BFS )请注意,您可以查看该函数的networkx 实现(使用此 BFS

Going back to the original problem回到最初的问题

We managed to find nodes from the same connected component, but that's not what you wanted, so we need to get original lists back.我们设法从相同的连接组件中找到节点,但这不是您想要的,因此我们需要取回原始列表。 To do this a bit faster on large graphs, one possibility is to first have a map from node names to their connected component index in the previous list:为了在大图上更快地做到这一点,一种可能性是首先有一个从节点名称到它们在上一个列表中的连接组件索引的映射:

node_cc_index = {u: i for i, cc in enumerate(graph_cc) for u in cc}
print(node_cc_index)

Which gives:这使:

{'g': 0, 'e': 0, 'f': 0, 'a': 1, 'c': 1, 'd': 1, 'b': 1, 'y': 2, 'x': 2}

We can use that to fill the list of edges split as you first requested:我们可以使用它来填充您第一次请求时拆分的边列表:

edges_groups = [[] for _ in graph_cc]
for u, v in edges:
    edges_groups[node_cc_index[u]].append([u, v])

print(edges_groups)

Which finally gives:最后给出:

[
    [['e', 'f'], ['f', 'g']], 
    [['a', 'b'], ['a', 'c'], ['b', 'c'], ['c', 'd']], 
    [['x', 'y']]
]

Each sublist conserves the original order, but the order between lists is not preserved in any way (its a direct results from the random choice we made).每个子列表都保留了原始顺序,但是列表之间的顺序没有以任何方式保留(这是我们随机选择的直接结果)。 To avoid this, if its a problem, we could just replace the random pick by picking the "first" unvisited node.为了避免这种情况,如果出现问题,我们可以通过选择“第一个”未访问节点来替换随机选择。

Thanks for the suggestions, everyone, I guess I just needed the right vocabulary!谢谢大家的建议,我想我只是需要正确的词汇! Since under the linked answers, a couple of people asked for code to implement all of this, I thought I'd post an answer for future reference.由于在链接的答案下,有几个人要求提供代码来实现所有这些,我想我会发布一个答案以供将来参考。 Apparently, the concept of strongly connected components is not defined for non-directed graphs, so the solution is to look for connected components.显然,无向图没有定义连通分量的概念,所以解决方案是寻找连通分量。

For my answer, I adjusted the code found here: https://www.geeksforgeeks.org/connected-components-in-an-undirected-graph/对于我的回答,我调整了此处找到的代码: https : //www.geeksforgeeks.org/connected-components-in-an-undirected-graph/

It just requires reformulating l has integers, rather than strings:它只需要重新构造l具有整数,而不是字符串:

class Graph:
    # init function to declare class variables
    def __init__(self, V):
        self.V = V
        self.adj = [[] for i in range(V)]
 
    def DFSUtil(self, temp, v, visited):
 
        # Mark the current vertex as visited
        visited[v] = True
 
        # Store the vertex to list
        temp.append(v)
 
        # Repeat for all vertices adjacent
        # to this vertex v
        for i in self.adj[v]:
            if visited[i] == False:
 
                # Update the list
                temp = self.DFSUtil(temp, i, visited)
        return temp
 
    # method to add an undirected edge
    def addEdge(self, v, w):
        self.adj[v].append(w)
        self.adj[w].append(v)
 
    # Method to retrieve connected components
    # in an undirected graph
    def connectedComponents(self):
        visited = []
        cc = []
        for i in range(self.V):
            visited.append(False)
        for v in range(self.V):
            if visited[v] == False:
                temp = []
                cc.append(self.DFSUtil(temp, v, visited))
        return cc

Now we can run现在我们可以运行

l = [[0, 1], 
 [0, 2], 
 [1, 2],
 [2, 3],  
 [4, 5], 
 [5, 6], 
 [7, 8]]


g = Graph(
    max([item for sublist in l for item in sublist])+1
)

for sl in l:
    g.addEdge(sl[0], sl[1])
cc = g.connectedComponents()
print("Following are connected components")
print(cc)

And we get:我们得到:

Following are connected components
[[0, 1, 2, 3], [4, 5, 6], [7, 8]]

We can then go back and group the original list:然后我们可以返回并对原始列表进行分组:

result = []
for sublist in cc:
    bucket = [x for x in l if any(y in x for y in sublist)]
    result.append(bucket)

Output:输出:

[[[0, 1], [0, 2], [1, 2], [2, 3]], [[4, 5], [5, 6]], [[7, 8]]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM