简体   繁体   English

为无向环状序列创建唯一标识符

[英]Create unique identifier for undirected circular sequences

Say I have an undirected circular sequence that looks like this:假设我有一个如下所示的无向循环序列:

  1 —— 2 —— 3
 /           \
1             1
|             |
3             2
 \           /
  3 —— 2 —— 3

Say I have 3 sequences as below, represented by lists of numbers:假设我有如下 3 个序列,由数字列表表示:

seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right

Since the sequence is directionless, all 3 sequences are essentially identical, and represents the circular sequence above.由于序列是无方向的,所有3个序列本质上是相同的,并且代表了上面的循环序列。 In reality, I have thousands of these undirected circular sequences, so it is impossible to compare every pair of them.实际上,我有成千上万个这样的无向循环序列,因此不可能比较每一对。 Therefore, I want to create a unique identifier that can represent each unique undirected circular sequence.因此,我想创建一个唯一标识符,可以表示每个唯一的无向循环序列。 For example, the identifier should be the same for the 3 sequences above.例如,上述 3 个序列的标识符应该相同。

My idea is to treat this type of sequences as circular graphs.我的想法是将这种类型的序列视为圆形图。 Then I can assign edge weights as the differences between the two connected nodes, and find the path that traverses all nodes while maximizing the sum of all edge weights.然后我可以将边权重分配为两个连接节点之间的差异,并找到遍历所有节点的路径,同时最大化所有边权重的总和。 Below is my Python implementation:下面是我的 Python 实现:

def identifier(seq):
    delta_sum = float('-inf')
    res_seq = []
    for i in range(len(seq)):
        new_seq = seq[i:] + seq[:i]
        ds = sum([new_seq[j+1] - new_seq[j] for j in range(len(seq)-1)])
        if ds > delta_sum:
            delta_sum = ds
            res_seq = new_seq
        if -ds > delta_sum:
            delta_sum = -ds
            res_seq = new_seq[::-1]
    return ','.join(map(str, res_seq))

print(identifier(seq1))
print(identifier(seq2))
print(identifier(seq3))

Output:输出:

1,1,2,3,1,2,3,2,3,3
1,1,2,3,1,2,3,2,3,3
1,2,3,2,3,3,1,1,2,3

Clearly my algorithm isn't working.显然我的算法不起作用。 It creates the same identifier for the first two sequences, but creates a different one for the 3rd sequence.它为前两个序列创建相同的标识符,但为第三个序列创建不同的标识符。 Can anyone suggest a relatively fast algorithm (preferably Python code) that can create a unique identifier for this kind of sequences?任何人都可以提出一种相对较快的算法(最好是 Python 代码)来为此类序列创建唯一标识符吗?

Below are some related questions, but not exactly what I want to achieve:以下是一些相关的问题,但不完全是我想要达到的目标:

How to check whether two lists are circularly identical in Python 如何在 Python 中检查两个列表是否循环相同

Fast way to compare cyclical data 比较周期性数据的快速方法

You could use tuples as hashable identifiers and pick the smallest one from the possible rotations of the sequence:您可以使用元组作为可散列的标识符,并从序列的可能旋转中选择最小的一个:

def identifier(s):
    return min((*s[i:],*s[:i])[::d] for d in (1,-1) for i in range(len(s)))

Output:输出:

seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right

print(identifier(seq1))
print(identifier(seq2))
print(identifier(seq3))
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)

Given that the smallest tuple will start with the smallest value, you can optimize this a bit by first finding the minimum value and only comparing tuples that are formed by starting form the minimum value indexes:鉴于最小的元组将从最小值开始,您可以通过首先找到最小值并仅比较从最小值索引开始形成的元组来优化它:

def identifier(seq):
    start  = min(seq)
    starts = [i for i,v in enumerate(seq) if v == start]
    return min((*seq[i:],*seq[:i])[::d] for d in (1,-1) for i in starts)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM