简体   繁体   English

最佳字符串,其中包含给定集中的所有字符串作为子字符串

[英]Optimal string containing all strings from a given set as substrings

I'm working with some code that needs a large (constant) bit array. 我正在使用一些需要大(恒定)位数组的代码。 Since it contains large constant spans (all 0's or all 1's) I've broken it into a two-level table that allows duplicate spans (constant or otherwise) to be elided, as in: 由于它包含较大的常数跨度(全0或全1),因此将其分为两级表,可以消除重复的跨度(常数或其他),如下所示:

bitn = table2[table1[n/256]+n%256/8]&1<<n%8

At this point, entries of table1 are all multiples of 32 (256 bits), but I wonder if significant savings could be achieved by allowing the spans in table2 to overlap. 此时,表table1条目都是32(256位)的倍数,但我想知道是否可以通过允许table2的跨度重叠来实现显着的节省。 So my question is (stated in the abstract form): 所以我的问题是(以抽象形式表示):

Given N strings { S_n : n=1..N } each of length K, is there an efficient way to find the shortest length string S such that each S_n is a substring of S? 给定N个每个长度为K的字符串{S_n:n = 1..N},是否有一种有效的方法来找到最短长度的字符串S,以使每个S_n是S的子字符串?

(Note that since I probably want to keep my bit arrays 8-bit aligned, my particular application of the problem will probably be dealing with strings of 8-bit bytes rather than strings of bits, but the problem makes sense on any sense of character - bit, byte, or whatever.) (请注意,由于我可能想使我的位数组保持8位对齐,因此我对该问题的特殊应用可能是处理8位字节的字符串,而不是位字符串,但是该问题在任何意义上都是有意义的-位,字节或其他内容。)

First, this problem can be formulated as a TSP. 首先,可以将这个问题表述为TSP。 We have a set of nodes (each string is a node) and we need to find path that visits all nodes. 我们有一组节点(每个字符串是一个节点),我们需要找到访问所有节点的路径。 Distance between strings x and y is defined as len(xy)+len(y) where xy is the optimal string that has both x and y, and which starts with x (eg. x=000111, y=011100, xy=0001100, distance(x,y) = 8-6=2). 字符串x和y之间的距离定义为len(xy)+ len(y),其中xy是同时具有x和y且以x开头的最佳字符串(例如x = 000111,y = 011100,xy = 0001100 ,距离(x,y)= 8-6 = 2)。

Note that this also obeys triangular inequality (distance(x,z) <= distance(x,y)+ distance(y,z) ). 注意,这也服从三角形不等式(distance(x,z)<= distance(x,y)+ distance(y,z))。 The distances are integers from 1 to k. 距离是1到k的整数。 Also, distances are asymmetric. 而且,距离是不对称的。

This version of TSP is called (1,B)-ATSP. TSP的此版本称为(1,B)-ATSP。 See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.3439 for the analysis of such problem and approximate solution. 有关此类问题的分析和近似解决方案,请参见http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.3439

Pertaining to your large, constant bit array with large sections of constant, here is an alternate way to design the table for you to consider (I don't know your exact needs so I can't say if it would help or not). 与具有大量常量部分的大型常量位数组有关,这是设计表供您考虑的另一种方法(我不知道您的确切需求,因此无法说出是否有帮助)。

Consider something like a radix tree . 考虑像基数树之类的东西。 For ease of explanation, let me define get function: 为了便于说明,让我定义get函数:

#define TYP_CONST
#define TYP_ARRAY

struct node {
    unsigned min;
    unsigned max;
    int typ;
    union {
        char *bits;
        int constant;
    } d;
    struct node *left;
    struct node *right;
}

struct bit_array {
    unsigned length;
    struct node *root;
}

int get(struct bit_array *b, unsigned ix)
{
    struct node *n = b->root;
    if (ix >= b->length)
        return -1;
    while (n) {
        if (ix > n->max) {
            n = n->right;
            continue;
        } else if (ix < n->min) {
            n = n->left;
            continue;
        }
        if (n->typ == TYP_CONST)
            return n->d.constant;
        ix -= n->min;
        return !!(n->d.bits[ix/8] & (1 << ix%8));
    }
    return -1;
}

In human terms, you want to search through the tree for your bit. 用人类的话来说,您想在树中搜索自己的位。 Every node is responsible for a range of bits, and you binary search through the ranges to find the range you want. 每个节点负责一个位范围,您可以对范围进行二进制搜索以找到所需的范围。

Once you find your range, there are two options: constant, or array. 找到范围后,有两个选择:常数或数组。 If constant, just return the constant (saves you a lot of memory). 如果为常量,则只需返回常量(可以节省大量内存)。 If array, then you do the array lookup in the bit array. 如果是数组,则在位数组中进行数组查找。

You are going to have O(log n) lookup time instead of O(1).... although it should still be incredibly fast. 您将拥有O(log n)查找时间,而不是O(1)....,尽管它仍然应该非常快。

The difficulty here is that setting up the appropriate data structures is annoying and error-prone. 这里的困难在于设置适当的数据结构很烦人并且容易出错。 But you said the arrays were constant, so that may not be a problem. 但是您说数组是恒定的,所以这可能不是问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM