繁体   English   中英

合并重叠 (str) 对象

[英]Merging overlapping (str) objects

问题如下:

我想 go 拥有这套

{'A/B', 'B/C', 'C/D', 'D/E', ..., 'U/V', 'V/W', ..., 'X/Y', ..., 'Z', ...}

到这个集合

{'A/B/C/D/E', ..., 'U/V/W', ..., 'X/Y', ..., 'Z', ...}

其中对象 A、B、C... 只是字符串。 output 的解决方案应该独立于对象出现的顺序(即如果你打乱集合中的对象,解决方案应该总是相同的)

换句话说,我想合并重叠的对象。

以下形式的输入不会发生:

{"A/B", "B/C", "B/D"}
{"A/B", "B/C", "C/A"}

可以有其中没有'/'的对象。

这是我想出的部分解决方案:

    example={'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}
    
    def ext_3plus(unit):
        for couple in list(itertools.combinations(list(unit),2)):
            if '/' in couple[0] and '/' in couple[1]:
                if couple[0].split('/')[0]==couple[1].split('/')[1]:
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[1].split('/')[0]+'/'+couple[0])
                if couple[0].split('/')[1]==couple[1].split('/')[0]:
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[0]+'/'+couple[1].split('/')[1])
            else: #the input can contain object not having '/'
                continue

有两个问题,首先它只做一次迭代,结果在{'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}

是:

{'A/B/C', 'C/D/E', 'U/V/W', 'X/Y'}

其次,如果我包含不包含'/'的对象,则输入为{'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y','Z'} ,结果与上一个不同:

{'A/B', 'B/C/D', 'D/E', 'U/V/W', 'X/Y', 'Z'}

所以第一次迭代应该有一个递归调用等等。应该怎么做?

如果我理解正确,这可以看作是一个图形问题,并这样解决:

import networkx as nx

example = {'A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y', "Z"}

# convert each string to a and edge
# each pattern to the side of / is a node
edges = [tuple(s.split("/")) for s in example if "/" in s]

nodes = [s for s in example if "/" not in s]

# create directed graph from edges
g = nx.from_edgelist(edges, create_using=nx.DiGraph)
g.add_nodes_from(nodes)

# find each path using topological sort
runs, current = [], []
for e in nx.topological_sort(g):
    # start a new path each time a node with in-degree 0
    # in-degree 0 means it is the start of a new path
    if g.in_degree(e) == 0:
        if current:
            runs.append(current)
            current = []
    current.append(e)

if current:
    runs.append(current)

# format the result
result = ["/".join(run) for run in runs]
print(result)

Output

['Z', 'U/V/W', 'X/Y', 'A/B/C/D/E']

如果我没记错的话,这种方法的整体复杂性是O(n) 可以在此处找到有关拓扑排序的更多信息。

更新

in.networkx 2.6.4 使用lexicographical_topological_sort

您可以使用递归生成器 function:

vals = ['A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y']
data = [i.split('/') for i in vals]
def paths(d, c = [], s = []):
   if not (k:=[b for a, b in data if a == d]):
      yield c+[d]
      if (t:=[a for a, b in data if a not in s+[d]]):
         yield from paths(t[0], c = [], s=s+[d])
   else:
       yield from [j for i in k for j in paths(i, c=c+[d], s=s+[d])]

vals = list(paths(data[0][0]))

Output:

[['A', 'B', 'C', 'D', 'E'], ['U', 'V', 'W'], ['X', 'Y']]

但是,应该注意的是,上述解决方案仅适用于包含标准边缘定义的输入。 如果vals的内容可以在除以/的项目数中,那么你可以使用下面的解决方案:

class Node:
    def __init__(self, n, c = []):
       self.n, self.c = n, c
    def __contains__(self, e):
       return e[0] == self.n or e[-1] == self.n or any(e in i for i in self.c)
    def add_edge(self, e):
       if self.n != e[0] and len(e) > 1 and (m:=[i for i in self.c if i.n == e[-1]]):
          self.c = [i for i in self.c if i != m[0]]+[Node(e[0], [m[0]])]
       elif self.n == e[0]:
          if len(e) > 1 and not any(i.n == e[1] for i in self.c):
             self.c = [*self.c, Node(e[1])]
       elif (m:=[i for i in self.c if e in i]):
          m[0].add_edge(e)
       else:
          self.c = [*self.c, Node(e[0], [] if len(e) == 1 else [Node(e[1])])]
                    
vals = ['A/B/C', 'A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y', 'K']
n = Node(None)
for i in vals:
    k = i.split('/')
    for j in range(len(k)):
        n.add_edge(k[j:j+2])

def get_paths(n, c = []):
   if not n.c:
      yield c+[n.n]
   else:
      yield from [j for k in n.c for j in get_paths(k, c+[n.n])]

final_result = [i[1:] for i in get_paths(n)]
print(final_result)

Output:

[['A', 'B', 'C', 'D', 'E'], ['U', 'V', 'W'], ['X', 'Y'], ['K']]

使用 class Node的 trie 样式方法,输入 ( vals ) 的顺序无关紧要(不需要排序),并且可以添加任何深度的输入路径。

它可能不是最有效的,但您可以重复循环直到没有任何修改。

def ext_3plus(unit):
    while True:
        oldlen = len(unit)
        for couple in itertools.combinations(list(unit),2):
            if '/' in couple[0] and '/' in couple[1]:
                if couple[0].split('/')[0]==couple[1].split('/')
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[1].split('/')[0]+'/'+couple[0])
                    modified = True
                if couple[0].split('/')[1]==couple[1].split('/')[0]
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[0]+'/'+couple[1].split('/')[1])
        if len(unit) == oldlen:
            # Nothing was merged, so we're done
            break

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM