按共享元素分组列表列表

Question

假设我有以下子列表列表：

l = [['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]

我的目标是以某种方式将该列表重新排列为“存储桶”，以便存储桶中的每个子列表与存储桶中的至少一个其他子列表共享一个元素，并且不与不同存储桶中的任何子列表共享任何元素。 用文字来理解这一点有点困难，但在这种情况下，所需的结果是：

result = [
    [
        ['a', 'b'],
        ['a', 'c'],
        ['b', 'c'],
        ['c', 'd']
    ],
    [
        ['e', 'f'],
        ['f', 'g']
    ],
    [
        ['x', 'y']   
    ],
]

这里的想法是['a','b']进入 Bucket 1。 ['a','b']与['a', 'c']和['b', 'c']共享元素['b', 'c'] ，所以那些也进入 Bucket 1。 现在['c', 'd']也与当前在 Bucket 1 中的元素共享一个元素c ，因此它也被添加到 Bucket 1 中。 之后，不再有包含与 Bucket 1 中的元素共享的元素的子列表，因此我们打开一个新的 Bucket 2，以['e', 'f']开头。 ['e', 'f']与['f', 'g'] ['e', 'f']共享一个元素，因此它也进入 Bucket 2。 然后我们完成了 Bucket 2。 ['x', 'y']获得了自己的 Bucket 3。

我知道如何递归地完成所有这些，但l非常大，我想知道是否有一种更快的方法将元素组合在一起！

Answer 1

这段代码似乎有效：

l = [
 ['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]
 
l2 = []

# merge lists to sets
for x in l:
  for x2 in l2:
     if len(x2 & set(x)):
         x2 |= set(x)
         break
  else:
     l2.append(set(x))

# output lists
d = {i:[] for i in range(len(l2))}

# match each list to set
for x in l:
  for k in d:
    if len(set(x) & set(l2[k])):
       d[k].append(x) 

# merge dictionary values
fl = [v for v in d.values()]

print(fl)

输出

[[['a', 'b'], 
  ['a', 'c'], 
  ['b', 'c'], 
  ['c', 'd']], 
 [['e', 'f'], 
  ['f', 'g']], 
 [['x', 'y']]]

Answer 2

这是使用建议的图形问题简化的替代方法。 我希望代码足够清晰，我仍然会添加一些解释。

转换为邻接列表

只是因为它更容易使用：

from collections import defaultdict

edges = [
    ['a', 'b'], 
    ['a', 'c'], 
    ['b', 'c'],
    ['c', 'd'],  
    ['e', 'f'], 
    ['f', 'g'], 
    ['x', 'y'],
]

def graph_from_edges(edge):
    graph = defaultdict(set)
    for u, v in edges:
        graph[u].add(v)
        graph[v].add(u)
    return graph

graph = graph_from_edges(edges)

该graph现在包含：

{
    'a': {'c', 'b'}, 
    'b': {'c', 'a'}, 
    'c': {'d', 'b', 'a'}, 
    'd': {'c'}, 
    'e': {'f'}, 
    'f': {'e', 'g'}, 
    'g': {'f'}, 
    'x': {'y'}, 
    'y': {'x'}
}

找到给定节点的连通分量

这是一个更简单的子问题来解决，我们给出一个节点并探索附近的图，直到我们只剩下可用的访问节点：

def connected_component_from(graph, starting_node):
    nodes = set(starting_node)
    visited = set()
    while nodes:
        node = nodes.pop()
        yield node
        visited.add(node)
        nodes |= graph[node] - visited

print(list(connected_component_from(graph, 'a')))

这将打印节点'a'的连接组件中的节点列表：

['a', 'b', 'c', 'd']

查找所有连接的组件

现在我们只需要重复前面的操作，直到我们访问了图中的所有节点。 为了发现新的未开发组件，我们只需选择一个随机未访问的节点来重新开始：

def connected_components(graph):
    all_nodes = set(graph.keys())
    visited = set() 
    while all_nodes - visited:
        starting_node = random_node(all_nodes - visited)
        connected_component = set(connected_component_from(graph, starting_node))
        yield connected_component
        visited |= connected_component

def random_node(nodes):
    return random.sample(nodes, 1)


graph_cc = list(connected_components(graph))
print(graph_cc)

哪个打印：

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

捷径

您还可以使用现有库为您计算这些连接的组件，例如networkx ：

import networkx as nx

G = nx.Graph()

G.add_edges_from(edges)
cc = list(nx.connected_components(G))
print(graph_cc)

它还打印：

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

在实践中，这将是最好的解决方案，但如果目标是学习新事物，那就没那么有趣了。 请注意，您可以查看该函数的networkx 实现（使用此 BFS ）

回到最初的问题

我们设法从相同的连接组件中找到节点，但这不是您想要的，因此我们需要取回原始列表。 为了在大图上更快地做到这一点，一种可能性是首先有一个从节点名称到它们在上一个列表中的连接组件索引的映射：

node_cc_index = {u: i for i, cc in enumerate(graph_cc) for u in cc}
print(node_cc_index)

这使：

{'g': 0, 'e': 0, 'f': 0, 'a': 1, 'c': 1, 'd': 1, 'b': 1, 'y': 2, 'x': 2}

我们可以使用它来填充您第一次请求时拆分的边列表：

edges_groups = [[] for _ in graph_cc]
for u, v in edges:
    edges_groups[node_cc_index[u]].append([u, v])

print(edges_groups)

最后给出：

[
    [['e', 'f'], ['f', 'g']], 
    [['a', 'b'], ['a', 'c'], ['b', 'c'], ['c', 'd']], 
    [['x', 'y']]
]

每个子列表都保留了原始顺序，但是列表之间的顺序没有以任何方式保留（这是我们随机选择的直接结果）。 为了避免这种情况，如果出现问题，我们可以通过选择“第一个”未访问节点来替换随机选择。

Answer 3

谢谢大家的建议，我想我只是需要正确的词汇！ 由于在链接的答案下，有几个人要求提供代码来实现所有这些，我想我会发布一个答案以供将来参考。 显然，无向图没有定义强连通分量的概念，所以解决方案是寻找连通分量。

对于我的回答，我调整了此处找到的代码： https : //www.geeksforgeeks.org/connected-components-in-an-undirected-graph/

它只需要重新构造l具有整数，而不是字符串：

class Graph:
    # init function to declare class variables
    def __init__(self, V):
        self.V = V
        self.adj = [[] for i in range(V)]
 
    def DFSUtil(self, temp, v, visited):
 
        # Mark the current vertex as visited
        visited[v] = True
 
        # Store the vertex to list
        temp.append(v)
 
        # Repeat for all vertices adjacent
        # to this vertex v
        for i in self.adj[v]:
            if visited[i] == False:
 
                # Update the list
                temp = self.DFSUtil(temp, i, visited)
        return temp
 
    # method to add an undirected edge
    def addEdge(self, v, w):
        self.adj[v].append(w)
        self.adj[w].append(v)
 
    # Method to retrieve connected components
    # in an undirected graph
    def connectedComponents(self):
        visited = []
        cc = []
        for i in range(self.V):
            visited.append(False)
        for v in range(self.V):
            if visited[v] == False:
                temp = []
                cc.append(self.DFSUtil(temp, v, visited))
        return cc

现在我们可以运行

l = [[0, 1], 
 [0, 2], 
 [1, 2],
 [2, 3],  
 [4, 5], 
 [5, 6], 
 [7, 8]]


g = Graph(
    max([item for sublist in l for item in sublist])+1
)

for sl in l:
    g.addEdge(sl[0], sl[1])
cc = g.connectedComponents()
print("Following are connected components")
print(cc)

我们得到：

Following are connected components
[[0, 1, 2, 3], [4, 5, 6], [7, 8]]

然后我们可以返回并对原始列表进行分组：

result = []
for sublist in cc:
    bucket = [x for x in l if any(y in x for y in sublist)]
    result.append(bucket)

输出：

[[[0, 1], [0, 2], [1, 2], [2, 3]], [[4, 5], [5, 6]], [[7, 8]]]

按共享元素分组列表列表

问题描述

3 个解决方案

解决方案1
1 2020-11-07 01:12:19

解决方案2
1 2020-11-07 03:12:41

转换为邻接列表

找到给定节点的连通分量

查找所有连接的组件

捷径

回到最初的问题

解决方案3
0 已采纳 2020-11-07 01:13:47

按共享元素分组列表列表

问题描述

3 个解决方案

解决方案1 1 2020-11-07 01:12:19

解决方案2 1 2020-11-07 03:12:41

转换为邻接列表

找到给定节点的连通分量

查找所有连接的组件

捷径

回到最初的问题

解决方案3 0 已采纳 2020-11-07 01:13:47

解决方案1
1 2020-11-07 01:12:19

解决方案2
1 2020-11-07 03:12:41

解决方案3
0 已采纳 2020-11-07 01:13:47