[英]Merging overlapping (str) objects
问题如下:
我想 go 拥有这套
{'A/B', 'B/C', 'C/D', 'D/E', ..., 'U/V', 'V/W', ..., 'X/Y', ..., 'Z', ...}
到这个集合
{'A/B/C/D/E', ..., 'U/V/W', ..., 'X/Y', ..., 'Z', ...}
其中对象 A、B、C... 只是字符串。 output 的解决方案应该独立于对象出现的顺序(即如果你打乱集合中的对象,解决方案应该总是相同的)
换句话说,我想合并重叠的对象。
以下形式的输入不会发生:
{"A/B", "B/C", "B/D"}
{"A/B", "B/C", "C/A"}
可以有其中没有'/'
的对象。
这是我想出的部分解决方案:
example={'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}
def ext_3plus(unit):
for couple in list(itertools.combinations(list(unit),2)):
if '/' in couple[0] and '/' in couple[1]:
if couple[0].split('/')[0]==couple[1].split('/')[1]:
unit.remove(couple[0])
unit.remove(couple[1])
unit.add(couple[1].split('/')[0]+'/'+couple[0])
if couple[0].split('/')[1]==couple[1].split('/')[0]:
unit.remove(couple[0])
unit.remove(couple[1])
unit.add(couple[0]+'/'+couple[1].split('/')[1])
else: #the input can contain object not having '/'
continue
有两个问题,首先它只做一次迭代,结果在{'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}
是:
{'A/B/C', 'C/D/E', 'U/V/W', 'X/Y'}
其次,如果我包含不包含'/'
的对象,则输入为{'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y','Z'}
,结果与上一个不同:
{'A/B', 'B/C/D', 'D/E', 'U/V/W', 'X/Y', 'Z'}
所以第一次迭代应该有一个递归调用等等。应该怎么做?
如果我理解正确,这可以看作是一个图形问题,并这样解决:
import networkx as nx
example = {'A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y', "Z"}
# convert each string to a and edge
# each pattern to the side of / is a node
edges = [tuple(s.split("/")) for s in example if "/" in s]
nodes = [s for s in example if "/" not in s]
# create directed graph from edges
g = nx.from_edgelist(edges, create_using=nx.DiGraph)
g.add_nodes_from(nodes)
# find each path using topological sort
runs, current = [], []
for e in nx.topological_sort(g):
# start a new path each time a node with in-degree 0
# in-degree 0 means it is the start of a new path
if g.in_degree(e) == 0:
if current:
runs.append(current)
current = []
current.append(e)
if current:
runs.append(current)
# format the result
result = ["/".join(run) for run in runs]
print(result)
Output
['Z', 'U/V/W', 'X/Y', 'A/B/C/D/E']
如果我没记错的话,这种方法的整体复杂性是O(n)
。 可以在此处找到有关拓扑排序的更多信息。
更新
in.networkx 2.6.4 使用lexicographical_topological_sort
您可以使用递归生成器 function:
vals = ['A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y']
data = [i.split('/') for i in vals]
def paths(d, c = [], s = []):
if not (k:=[b for a, b in data if a == d]):
yield c+[d]
if (t:=[a for a, b in data if a not in s+[d]]):
yield from paths(t[0], c = [], s=s+[d])
else:
yield from [j for i in k for j in paths(i, c=c+[d], s=s+[d])]
vals = list(paths(data[0][0]))
Output:
[['A', 'B', 'C', 'D', 'E'], ['U', 'V', 'W'], ['X', 'Y']]
但是,应该注意的是,上述解决方案仅适用于包含标准边缘定义的输入。 如果vals
的内容可以在除以/
的项目数中,那么你可以使用下面的解决方案:
class Node:
def __init__(self, n, c = []):
self.n, self.c = n, c
def __contains__(self, e):
return e[0] == self.n or e[-1] == self.n or any(e in i for i in self.c)
def add_edge(self, e):
if self.n != e[0] and len(e) > 1 and (m:=[i for i in self.c if i.n == e[-1]]):
self.c = [i for i in self.c if i != m[0]]+[Node(e[0], [m[0]])]
elif self.n == e[0]:
if len(e) > 1 and not any(i.n == e[1] for i in self.c):
self.c = [*self.c, Node(e[1])]
elif (m:=[i for i in self.c if e in i]):
m[0].add_edge(e)
else:
self.c = [*self.c, Node(e[0], [] if len(e) == 1 else [Node(e[1])])]
vals = ['A/B/C', 'A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y', 'K']
n = Node(None)
for i in vals:
k = i.split('/')
for j in range(len(k)):
n.add_edge(k[j:j+2])
def get_paths(n, c = []):
if not n.c:
yield c+[n.n]
else:
yield from [j for k in n.c for j in get_paths(k, c+[n.n])]
final_result = [i[1:] for i in get_paths(n)]
print(final_result)
Output:
[['A', 'B', 'C', 'D', 'E'], ['U', 'V', 'W'], ['X', 'Y'], ['K']]
使用 class Node
的 trie 样式方法,输入 ( vals
) 的顺序无关紧要(不需要排序),并且可以添加任何深度的输入路径。
它可能不是最有效的,但您可以重复循环直到没有任何修改。
def ext_3plus(unit):
while True:
oldlen = len(unit)
for couple in itertools.combinations(list(unit),2):
if '/' in couple[0] and '/' in couple[1]:
if couple[0].split('/')[0]==couple[1].split('/')
unit.remove(couple[0])
unit.remove(couple[1])
unit.add(couple[1].split('/')[0]+'/'+couple[0])
modified = True
if couple[0].split('/')[1]==couple[1].split('/')[0]
unit.remove(couple[0])
unit.remove(couple[1])
unit.add(couple[0]+'/'+couple[1].split('/')[1])
if len(unit) == oldlen:
# Nothing was merged, so we're done
break
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.