简体   繁体   中英

Finding all the shortest unique substring which are of same length?

Given a string sequence which contains only four letters, ['a','g','c','t'] for example: agggcttttaaaatttaatttgggccc .

Find all the shortest unique sub-string of the string sequence which are of equal length (the length should be minimum of all the unique sub-strings) ?

For example : aaggcgccttt answer: ['aa', 'ag', 'gg','cg', 'cc','ct'] explanation:shortest unique sub-string of length 2

I have tried using suffix-arrays coupled with longest common prefix but i am unable to draw the solution perfectly.

I'm not sure what you mean by "minimum unique sub-string", but looking at your example I assume you mean "shortest runs of a single letter". If this is the case, you just need to iterate through the string once (character by character) and count all the shortest runs you find. You should keep track of the length of the minimum run found so far (infinity at start) and the length of the current run.

\n

If you need to find the exact runs, you can add all the minimum runs you find to eg a list as you iterate through the string (and modify that list accordingly if a shorter run is found).

EDIT: I thought more about the problem and came up with the following solution.

We find all the unique sub-strings of length i (in ascending order). So, first we consider all sub-strings of length 1, then all sub-strings of length 2, and so on. If we find any, we stop, since the sub-string length can only increase from this point.

You will have to use a list to keep track of the sub-strings you've seen so far, and a list to store the actual sub-strings. You will also have to maintain them accordingly as you find new sub-strings.

Here's the Java code I came up with, in case you need it:

        String str = "aaggcgccttt";
        String curr = "";
        ArrayList<String> uniqueStrings = new ArrayList<String>();
        ArrayList<String> alreadySeen = new ArrayList<String>();

        for (int i = 1; i < str.length(); i++) {
            for (int j = 0; j < str.length() - i + 1; j++) {
                curr = str.substring(j, j + i); 

                if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
                    uniqueStrings.add(curr);
                    alreadySeen.add(curr);
                }
                else //Repeated sub-string found
                    uniqueStrings.remove(curr);
            }

            if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
                break;

            alreadySeen.clear();
        }

        //Output
        if (uniqueStrings.isEmpty())
            System.out.println(str);
        else {
            for (String s : uniqueStrings)
                System.out.println(s);
        }

The uniqueStrings list contains all the unique sub-strings of minimum length (used for output). The alreadySeen list keeps track of all the sub-strings that have already been seen (used to exclude repeating sub-strings).

I'll write some code in Python, because that's what I find the easiest. I actually wrote both the overlapping and the non-overlapping variants. As a bonus, it also checks that the input is valid. You seems to be interested only in the overlapping variant:

import itertools


def find_all(
        text,
        pattern,
        overlap=False):
    """
    Find all occurrencies of the pattern in the text.

    Args:
        text (str|bytes|bytearray): The input text.
        pattern (str|bytes|bytearray): The pattern to find.
        overlap (bool): Detect overlapping patterns.

    Yields:
        position (int): The position of the next finding.
    """
    len_text = len(text)
    offset = 1 if overlap else (len(pattern) or 1)
    i = 0
    while i < len_text:
        i = text.find(pattern, i)
        if i >= 0:
            yield i
            i += offset
        else:
            break


def is_valid(text, tokens):
    """
    Check if the text only contains the specified tokens.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.

    Returns:
        result (bool): The result of the check.
    """
    return set(text).issubset(set(tokens))


def shortest_unique_substr(
        text,
        tokens='acgt',
        overlapping=True,
        check_valid_input=True):
    """
    Find the shortest unique substring.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.
        overlap (bool)
        check_valid_input (bool): Check if the input is valid.

    Returns:
        result (set): The set of the shortest unique substrings.
    """
    def add_if_single_match(
            text,
            pattern,
            result,
            overlapping):
        match_gen = find_all(text, pattern, overlapping)
        try:
            next(match_gen)  # first match
        except StopIteration:
            # the pattern is not found, nothing to do
            pass
        else:
            try:
                next(match_gen)
            except StopIteration:
                # the pattern was found only once so add to results
                result.add(pattern)
            else:
                # the pattern is found twice, nothing to do
                pass

    # just some sanity check
    if check_valid_input and not is_valid(text, tokens):
        raise ValueError('Input text contains invalid tokens.')

    result = set()
    # shortest sequence cannot be longer than this
    if overlapping:
        max_lim = len(text) // 2 + 1
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for pattern_gen in itertools.product(tokens, repeat=2):
                pattern = ''.join(pattern_gen)
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    else:
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for i in range(len(text) - n):
                pattern = text[i:i + n]
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    return result

After some sanity check for the correctness of the outputs:

shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'

shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'

funcs = shortest_unique_substr_ovl, shortest_unique_substr_not

test_inputs = (
    'aaa',
    'aaaa',
    'aaggcgccttt',
    'agggcttttaaaatttaatttgggccc',
)

import functools

for func in funcs:
    print('Func:', func.__name__)
    for test_input in test_inputs:    
        print(func(test_input))
    print()
Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}

Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}

it is wise to benchmark how fast we actually are.

Below you can find some benchmarks, produced using some template code from here (the overlapping variant is in blue ):

benchmark_full benchmark_zoom

and the rest of the code for completeness:

def gen_input(n, tokens='acgt'):
    return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = tuple(2 ** (1 + i) for i in range(16))

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)

As far as the asymptotic time-complexity analysis is concerned, considering only the overlapping case, let N be the input size, let K be the number of tokens (4 in your case), find_all() is O(N), and the body of shortest_unique_substr is O(K²) ( + O((K - 1)²) + O((K - 2)²) + ... ). So, this is overall O(N*K²) or O(N*(Σk²)) (for k = 1, …, K ), since K is fixed, this is O(N) , as the benchmarks seem to indicate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM