简体   繁体   中英

Optimal string containing all strings from a given set as substrings

I'm working with some code that needs a large (constant) bit array. Since it contains large constant spans (all 0's or all 1's) I've broken it into a two-level table that allows duplicate spans (constant or otherwise) to be elided, as in:

bitn = table2[table1[n/256]+n%256/8]&1<<n%8

At this point, entries of table1 are all multiples of 32 (256 bits), but I wonder if significant savings could be achieved by allowing the spans in table2 to overlap. So my question is (stated in the abstract form):

Given N strings { S_n : n=1..N } each of length K, is there an efficient way to find the shortest length string S such that each S_n is a substring of S?

(Note that since I probably want to keep my bit arrays 8-bit aligned, my particular application of the problem will probably be dealing with strings of 8-bit bytes rather than strings of bits, but the problem makes sense on any sense of character - bit, byte, or whatever.)

First, this problem can be formulated as a TSP. We have a set of nodes (each string is a node) and we need to find path that visits all nodes. Distance between strings x and y is defined as len(xy)+len(y) where xy is the optimal string that has both x and y, and which starts with x (eg. x=000111, y=011100, xy=0001100, distance(x,y) = 8-6=2).

Note that this also obeys triangular inequality (distance(x,z) <= distance(x,y)+ distance(y,z) ). The distances are integers from 1 to k. Also, distances are asymmetric.

This version of TSP is called (1,B)-ATSP. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.3439 for the analysis of such problem and approximate solution.

Pertaining to your large, constant bit array with large sections of constant, here is an alternate way to design the table for you to consider (I don't know your exact needs so I can't say if it would help or not).

Consider something like a radix tree . For ease of explanation, let me define get function:

#define TYP_CONST
#define TYP_ARRAY

struct node {
    unsigned min;
    unsigned max;
    int typ;
    union {
        char *bits;
        int constant;
    } d;
    struct node *left;
    struct node *right;
}

struct bit_array {
    unsigned length;
    struct node *root;
}

int get(struct bit_array *b, unsigned ix)
{
    struct node *n = b->root;
    if (ix >= b->length)
        return -1;
    while (n) {
        if (ix > n->max) {
            n = n->right;
            continue;
        } else if (ix < n->min) {
            n = n->left;
            continue;
        }
        if (n->typ == TYP_CONST)
            return n->d.constant;
        ix -= n->min;
        return !!(n->d.bits[ix/8] & (1 << ix%8));
    }
    return -1;
}

In human terms, you want to search through the tree for your bit. Every node is responsible for a range of bits, and you binary search through the ranges to find the range you want.

Once you find your range, there are two options: constant, or array. If constant, just return the constant (saves you a lot of memory). If array, then you do the array lookup in the bit array.

You are going to have O(log n) lookup time instead of O(1).... although it should still be incredibly fast.

The difficulty here is that setting up the appropriate data structures is annoying and error-prone. But you said the arrays were constant, so that may not be a problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM