简体   繁体   English

查找字符串中子序列出现的次数

[英]Find the number of occurrences of a subsequence in a string

For example, let the string be the first 10 digits of pi, 3141592653 , and the subsequence be 123 .例如,设字符串为 pi 的前 10 位数字3141592653 ,子序列为123 Note that the sequence occurs twice:请注意,该序列出现两次:

3141592653
 1    2  3
   1  2  3

This was an interview question that I couldn't answer and I can't think of an efficient algorithm and it's bugging me.这是一个我无法回答的面试问题,我想不出一个有效的算法,这让我很烦恼。 I feel like it should be possible to do with a simple regex, but ones like 1.*2.*3 don't return every subsequence.我觉得应该可以用一个简单的正则表达式来做,但是像1.*2.*3这样的不返回每个子序列。 My naive implementation in Python (count the 3's for each 2 after each 1) has been running for an hour and it's not done.我在 Python 中的幼稚实现(在每个 1 之后为每个 2 计算 3)已经运行了一个小时,但还没有完成。

This is a classical dynamic programming problem (and not typically solved using regular expressions).这是一个经典的动态规划问题(通常不使用正则表达式解决)。

My naive implementation (count the 3's for each 2 after each 1) has been running for an hour and it's not done.我的幼稚实现(在每个 1 之后为每个 2 计算 3)已经运行了一个小时,但还没有完成。

That would be an exhaustive search approach which runs in exponential time.这将是一种以指数时间运行的详尽搜索方法。 (I'm surprised it runs for hours though). (虽然我很惊讶它运行了几个小时)。


Here's a suggestion for a dynamic programming solution:这是动态规划解决方案的建议:

Outline for a recursive solution:递归解决方案的概述:

(Apologies for the long description, but each step is really simple so bear with me ;-) (对冗长的描述表示歉意,但每一步都非常简单,所以请耐心等待;-)

  • If the subsequence is empty a match is found (no digits left to match!) and we return 1如果子序列为空,则找到匹配项(没有匹配的数字!),我们返回 1

  • If the input sequence is empty we've depleted our digits and can't possibly find a match thus we return 0如果输入序列为空,我们已经耗尽了我们的数字并且不可能找到匹配项,因此我们返回 0

  • (Neither the sequence nor the subsequence are empty.) (序列和子序列都不是空的。)

  • (Assume that " abcdef " denotes the input sequence, and " xyz " denotes the subsequence.) (假设“ abcdef ”表示输入序列,“ xyz ”表示子序列。)

  • Set result to 0result设置为 0

  • Add to the result the number of matches for bcdef and xyz (ie, discard the first input digit and recurse)bcdefxyz的匹配数添加到result (即丢弃第一个输入数字并递归)

  • If the first two digits match, ie, a = x如果前两位数字匹配,即a = x

    • Add to the result the number of matches for bcdef and yz (ie, match the first subsequence digit and recurse on the remaining subsequence digits)bcdefyz的匹配数添加到result (即匹配第一个子序列数字并递归剩余的子序列数字)

  • Return result返回result


Example例子

Here's an illustration of the recursive calls for input 1221 / 12 .这是对输入 1221 / 12的递归调用的说明。 (Subsequence in bold font, · represents empty string.) (粗体的子序列,·代表空字符串。)

在此处输入图片说明


Dynamic programming动态规划

If implemented naively, some (sub-)problems are solved multiple times (· / 2 for instance in the illustration above).如果天真地实施,一些(子)问题会被多次解决(例如上图中的·/2)。 Dynamic programming avoids such redundant computations by remembering the results from previously solved subproblems (usually in a lookup table).动态规划通过记住先前解决的子问题(通常在查找表中)的结果来避免这种冗余计算。

In this particular case we set up a table with在这种特殊情况下,我们设置了一个表

  • [length of sequence + 1] rows, and [序列长度 + 1] 行,以及
  • [length of subsequence + 1] columns: [子序列长度 + 1] 列:

在此处输入图片说明

The idea is that we should fill in the number of matches for 221 / 2 in the corresponding row / column.这个想法是我们应该在相应的行/列中填写 221 / 2的匹配数。 Once done, we should have the final solution in cell 1221 / 12 .完成后,我们应该在单元格 1221 / 12 中获得最终解决方案。

We start populating the table with what we know immediately (the "base cases"):我们开始用我们立即知道的(“基本情况”)填充表格:

  • When no subsequence digits are left, we have 1 complete match:当没有剩余的子序列数字时,我们有 1 个完全匹配:

在此处输入图片说明

  • When no sequence digits are left, we can't have any matches:当没有剩余序列数字时,我们不能有任何匹配项:

    在此处输入图片说明

We then proceed by populating the table top-down / left-to-right according to the following rule:然后,我们根据以下规则自上而下/从左到右填充表格:

  • In cell [ row ][ col ] write the value found at [ row -1][col].在单元格 [ row ][ col ] 中写入在 [ row -1][col] 处找到的值。

    Intuitively this means "The number of matches for 221 / 2 includes all the matches for 21 / 2 ."直观地说,这意味着“221 / 2的匹配数包括 21 / 2 的所有匹配项。”

  • If sequence at row row and subseq at column col start with the same digit, add the value found at [ row -1][ col -1] to the value just written to [ row ][ col ].如果在行和SUBSEQ在列顺序山坳用相同数字开头,增加值在找到[-1] [COL -1]刚刚写入[]值[山坳

    Intuitively this means "The number of matches for 1221 / 12 also includes all the matches for 221 / 12 ."直观地说,这意味着“ 1221 / 12的匹配数还包括 221 / 12 的所有匹配项。”

在此处输入图片说明

The final result looks as follows:最终结果如下:

在此处输入图片说明

and the value at the bottom right cell is indeed 2.右下角单元格的值确实是 2。


In Code在代码中

Not in Python, (my apologies).不是在 Python 中,(我很抱歉)。

class SubseqCounter {

    String seq, subseq;
    int[][] tbl;

    public SubseqCounter(String seq, String subseq) {
        this.seq = seq;
        this.subseq = subseq;
    }

    public int countMatches() {
        tbl = new int[seq.length() + 1][subseq.length() + 1];

        for (int row = 0; row < tbl.length; row++)
            for (int col = 0; col < tbl[row].length; col++)
                tbl[row][col] = countMatchesFor(row, col);

        return tbl[seq.length()][subseq.length()];
    }

    private int countMatchesFor(int seqDigitsLeft, int subseqDigitsLeft) {
        if (subseqDigitsLeft == 0)
            return 1;

        if (seqDigitsLeft == 0)
            return 0;

        char currSeqDigit = seq.charAt(seq.length()-seqDigitsLeft);
        char currSubseqDigit = subseq.charAt(subseq.length()-subseqDigitsLeft);

        int result = 0;

        if (currSeqDigit == currSubseqDigit)
            result += tbl[seqDigitsLeft - 1][subseqDigitsLeft - 1];

        result += tbl[seqDigitsLeft - 1][subseqDigitsLeft];

        return result;
    }
}

Complexity复杂

A bonus for this "fill-in-the-table" approach is that it is trivial to figure out complexity.这种“填表”方法的一个好处是计算复杂性是微不足道的。 A constant amount of work is done for each cell, and we have length-of-sequence rows and length-of-subsequence columns.为每个单元格完成了恒定的工作量,我们有序列长度的行和子序列长度的列。 Complexity is therefor O(MN) where M and N denote the lengths of the sequences.因此复杂度为O(MN) ,其中MN表示序列的长度。

Great answer, aioobe !很好的答案, aioobe To complement your answer, some possible implementations in Python:为了补充您的答案,Python 中的一些可能实现:

1) straightforward, naïve solution; 1) 直接、幼稚的解决方案; too slow!太慢了!

def num_subsequences(seq, sub):
    if not sub:
        return 1
    elif not seq:
        return 0
    result = num_subsequences(seq[1:], sub)
    if seq[0] == sub[0]:
        result += num_subsequences(seq[1:], sub[1:])
    return result

2) top-down solution using explicit memoization 2)使用显式记忆的自顶向下解决方案

def num_subsequences(seq, sub):
    m, n, cache = len(seq), len(sub), {}
    def count(i, j):
        if j == n:
            return 1
        elif i == m:
            return 0
        k = (i, j)
        if k not in cache:
            cache[k] = count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)
        return cache[k]
    return count(0, 0)

3) top-down solution using the lru_cache decorator (available from functools in python >= 3.2) 3) 使用 lru_cache 装饰器的顶向下解决方案(可从 python >= 3.2 中的 functools 获得)

from functools import lru_cache

def num_subsequences(seq, sub):
    m, n = len(seq), len(sub)
    @lru_cache(maxsize=None)
    def count(i, j):
        if j == n:
            return 1
        elif i == m:
            return 0
        return count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)
    return count(0, 0)

4) bottom-up, dynamic programming solution using a lookup table 4) 使用查找表的自底向上的动态规划解决方案

def num_subsequences(seq, sub):
    m, n = len(seq)+1, len(sub)+1
    table = [[0]*n for i in xrange(m)]
    def count(iseq, isub):
        if not isub:
            return 1
        elif not iseq:
            return 0
        return (table[iseq-1][isub] +
               (table[iseq-1][isub-1] if seq[m-iseq-1] == sub[n-isub-1] else 0))
    for row in xrange(m):
        for col in xrange(n):
            table[row][col] = count(row, col)
    return table[m-1][n-1]

5) bottom-up, dynamic programming solution using a single array 5) 使用单个数组的自底向上的动态规划解决方案

def num_subsequences(seq, sub):
    m, n = len(seq), len(sub)
    table = [0] * n
    for i in xrange(m):
        previous = 1
        for j in xrange(n):
            current = table[j]
            if seq[i] == sub[j]:
                table[j] += previous
            previous = current
    return table[n-1] if n else 1

One way to do it would be with two lists.一种方法是使用两个列表。 Call them Ones and OneTwos .称他们为OnesOneTwos

Go through the string, character by character.逐个字符地遍历字符串。

  • Whenever you see the digit 1 , make an entry in the Ones list.每当您看到数字1 ,请在Ones列表中输入一个条目。
  • Whenever you see the digit 2 , go through the Ones list and add an entry to the OneTwos list.每当您看到数字2 ,请浏览Ones列表并向OneTwos列表添加一个条目。
  • Whenever you see the digit 3 , go through the OneTwos list and output a 123 .每当您看到数字3 ,请查看OneTwos列表并输出123

In the general case that algorithm will be very fast, since it's a single pass through the string and multiple passes through what will normally be much smaller lists.在一般情况下,该算法将非常快,因为它是单次遍历字符串,多次遍历通常会小得多的列表。 Pathological cases will kill it, though.不过,病理情况会杀死它。 Imagine a string like 111111222222333333 , but with each digit repeated hundreds of times.想象一个像111111222222333333这样的字符串,但每个数字都重复了数百次。

from functools import lru_cache

def subseqsearch(string,substr):
    substrset=set(substr)
    #fixs has only element in substr
    fixs = [i for i in string if i in substrset]
    @lru_cache(maxsize=None) #memoisation decorator applyed to recs()
    def recs(fi=0,si=0):
        if si >= len(substr):
            return 1
        r=0
        for i in range(fi,len(fixs)):
            if substr[si] == fixs[i]:
                r+=recs(i+1,si+1)
        return r
    return recs()

#test
from functools import reduce
def flat(i) : return reduce(lambda x,y:x+y,i,[])
N=5
string = flat([[i for j in range(10) ] for i in range(N)])
substr = flat([[i for j in range(5) ] for i in range(N)]) 
print("string:","".join(str(i) for i in string),"substr:","".join(str(i) for i in substr),sep="\n")
print("result:",subseqsearch(string,substr))

output (instantly):输出(立即):

string:
00000000001111111111222222222233333333334444444444
substr:
0000011111222223333344444
result: 1016255020032

I have an interesting O(N) time and O(M) space solution for this problem.对于这个问题,我有一个有趣的O(N) 时间和 O(M) 空间解决方案
N being length of text and M being length of pattern to be searched for. N 是文本的长度,M 是要搜索的模式的长度。 I will explain the algorithm to you because i implement in C++.我会向你解释这个算法,因为我是用 C++ 实现的。

lets suppose the input given is as you provided 3141592653 and the pattern sequence whose count to find is 123 .让我们假设给定的输入是您提供的 3141592653 以及要查找的计数为 123 的模式序列。 I will begin by taking a hash map which maps characters to their positions in the input pattern .我将首先采用一个哈希映射,它将字符映射到它们在输入模式中的位置。 I also take an array of size M initially initialised to 0.我还取了一个大小为 M 的数组,最初初始化为 0。

    string txt,pat;
    cin >> txt >> pat;
    int n = txt.size(),m = pat.size();
    int arr[m];
    map<char,int> mp;
    map<char,int> ::iterator it;
    f(i,0,m)
    {
        mp[pat[i]] = i;
        arr[i] = 0;
    }

I start looking for elements from the back and check if each element is in the pattern or not .我开始从后面寻找元素并检查每个元素是否在模式中。 If that element is in the pattern .如果该元素在模式中。 I have to do something.我需要做一些事情。

Now when i start looking from the back if i find a 2 and previous i have not found any 3 .现在,当我开始从后面看时,如果我找到 2 和以前的我还没有找到任何 3 。 This 2 is of no value to us.Because any 1 found after it will atmost form such sequence 12 and 123 wont be formed Ryt?这个 2 对我们没有价值。因为在它最多形成这样的序列 12 和 123 之后发现的任何 1 不会形成 Ryt? think.思考。 Also at present position i have found a 2 and it will form sequences 123 only with 3's found previously and will form x sequences if we found x 3's previously (if part of sequence before 2 will be found)ryt?同样在目前的位置,我找到了一个 2,它只会与之前找到的 3 形成序列 123,如果我们之前发现了 x 3(如果将找到 2 之前的序列的一部分),它将形成 x 序列? So the complete algorithm is whenever i find an element which is present in the array I check for its position j correspondingly at which it was present in the pattern (stored in hash map).所以完整的算法是每当我找到一个存在于数组中的元素时,我会相应地检查它在模式中出现的位置 j(存储在哈希映射中)。 I just inc increment我只是增加了

 arr[j] += arr[j+1];

signifying it will contribute to sequences of 3 found before it ryt?表示它将有助于在它出现之前发现的 3 序列? and if j found is m-1 i will simply increment it如果 j 找到是 m-1 我会简单地增加它

 arr[j] += 1; 

Check the code snippets below which do these检查下面执行这些操作的代码片段

    for(int i = (n-1);i > -1;i--)
    {
        char ch = txt[i];
        if(mp.find(ch) != mp.end())
        {
            int j = mp[ch];
            if(j == (m-1))
                arr[j]++;
            else if(j < (m-1))
                arr[j] += arr[j+1];
            else
                {;}
        }
    }

Now consider the fact现在考虑这个事实

each index i in the array stores the number of times the substring of the pattern S[i,(m-1)] appers as a sequence of the input string So finally print the value of arr[0]数组中的每个索引 i 存储模式 S[i,(m-1)] 的子字符串作为输入字符串的序列出现的次数 所以最后打印 arr[0] 的值

    cout << arr[0] << endl;

Code with Output(unique chars in pattern) http://ideone.com/UWaJQF带有输出的代码(模式中的唯一字符) http://ideone.com/UWaJQF

Code with Output(repetitions allowed of chars) http://ideone.com/14DZh7带输出的代码(允许重复字符) http://ideone.com/14DZh7

Extension works only if pattern has unique elements What if pattern has unique elements then complexity may shoot to O(MN) Solution is similar without using DP just when an element occuring in the pattern appeared we just incremented array position j corresponding to it we now have to update all this characters' occurences in the pattern which will lead to a complexity of O(N*maxium frequency of a charater)扩展仅在模式具有唯一元素时才有效 如果模式具有唯一元素,那么复杂度可能会达到 O(MN) 解决方案与不使用 DP 的情况类似 就在模式中出现的元素出现时,我们只是增加了对应于它的数组位置 j 我们现在有更新模式中所有这些字符的出现,这将导致 O(N* 字符的最大频率) 的复杂性

#define f(i,x,y) for(long long i = (x);i < (y);++i)



int main()
{
long long T;
cin >> T;
while(T--)
{
    string txt,pat;
    cin >> txt >> pat;
    long long n = txt.size(),m = pat.size();
    long long arr[m];
    map<char,vector<long long> > mp;
    map<char,vector<long long> > ::iterator it;
    f(i,0,m)
    {
        mp[pat[i]].push_back(i);
        arr[i] = 0;
    }

    for(long long i = (n-1);i > -1;i--)
    {
        char ch = txt[i];
        if(mp.find(ch) != mp.end())
        {
            f(k,0,mp[ch].size())
            {
                long long j = mp[ch][k];
                if(j == (m-1))
                    arr[j]++;
                else if(j < (m-1))
                    arr[j] += arr[j+1];
                else
                    {;}
                }
                }
                }
                cout <<arr[0] << endl;
                }
                 }

can be extended in similar way without DP in strings with repetitions but then complexity would be more O(MN)可以在没有 DP 的重复字符串中以类似的方式扩展,但复杂度会更高 O(MN)

A Javascript answer based on dynamic programming from geeksforgeeks.org and the answer from aioobe :基于一个JavaScript的答案从geeksforgeeks.org动态规划和答案aioobe

class SubseqCounter {
    constructor(subseq, seq) {
        this.seq = seq;
        this.subseq = subseq;
        this.tbl = Array(subseq.length + 1).fill().map(a => Array(seq.length + 1));
        for (var i = 1; i <= subseq.length; i++)
          this.tbl[i][0] = 0;
        for (var j = 0; j <= seq.length; j++)
          this.tbl[0][j] = 1;
    }
    countMatches() {
        for (var row = 1; row < this.tbl.length; row++)
            for (var col = 1; col < this.tbl[row].length; col++)
                 this.tbl[row][col] = this.countMatchesFor(row, col);

        return this.tbl[this.subseq.length][this.seq.length];
    }
    countMatchesFor(subseqDigitsLeft, seqDigitsLeft) {
            if (this.subseq.charAt(subseqDigitsLeft - 1) !=     this.seq.charAt(seqDigitsLeft - 1)) 
            return this.tbl[subseqDigitsLeft][seqDigitsLeft - 1];  
            else
            return this.tbl[subseqDigitsLeft][seqDigitsLeft - 1] +     this.tbl[subseqDigitsLeft - 1][seqDigitsLeft - 1]; 
    }
}

My quick attempt:我的快速尝试:

def count_subseqs(string, subseq):
    string = [c for c in string if c in subseq]
    count = i = 0
    for c in string:
        if c == subseq[0]:
            pos = 1
            for c2 in string[i+1:]:
                if c2 == subseq[pos]:
                    pos += 1
                    if pos == len(subseq):
                        count += 1
                        break
        i += 1
    return count

print count_subseqs(string='3141592653', subseq='123')

Edit: This one should be correct also if 1223 == 2 and more complicated cases:编辑:如果1223 == 2和更复杂的情况,这个也应该是正确的:

def count_subseqs(string, subseq):
    string = [c for c in string if c in subseq]
    i = 0
    seqs = []
    for c in string:
        if c == subseq[0]:
            pos = 1
            seq = [1]
            for c2 in string[i + 1:]:
                if pos > len(subseq):
                    break
                if pos < len(subseq) and c2 == subseq[pos]:
                    try:
                        seq[pos] += 1
                    except IndexError:
                        seq.append(1)
                        pos += 1
                elif pos > 1 and c2 == subseq[pos - 1]:
                    seq[pos - 1] += 1
            if len(seq) == len(subseq):
                seqs.append(seq)
        i += 1
    return sum(reduce(lambda x, y: x * y, seq) for seq in seqs)

assert count_subseqs(string='12', subseq='123') == 0
assert count_subseqs(string='1002', subseq='123') == 0
assert count_subseqs(string='0123', subseq='123') == 1
assert count_subseqs(string='0123', subseq='1230') == 0
assert count_subseqs(string='1223', subseq='123') == 2
assert count_subseqs(string='12223', subseq='123') == 3
assert count_subseqs(string='121323', subseq='123') == 3
assert count_subseqs(string='12233', subseq='123') == 4
assert count_subseqs(string='0123134', subseq='1234') == 2
assert count_subseqs(string='1221323', subseq='123') == 5

How to count all three-member sequences 1..2..3 in the array of digits.如何计算数字数组中的所有三成员序列 1..2..3。

Quickly and simply快速简单

Notice, we need not FIND all sequences, we need only COUNT them.注意,我们不需要查找所有序列,我们只需要对它们进行计数。 So, all algorithms that search for sequences, are excessively complex.因此,所有搜索序列的算法都过于复杂。

  1. Throw off every digit, that is not 1,2,3.扔掉每个数字,不是 1,2,3。 The result will be char array A结果将是字符数组 A
  2. Make parallel int array B of 0's.使并行 int 数组 B 为 0。 Running A from the end, count for the each 2 in A the number of 3's in A after them.从最后开始运行 A,计算 A 中的每个 2 之后 A 中 3 的数量。 Put these numbers into the appropriate elements of B.将这些数字放入 B 的适当元素中。
  3. Make parallel int array C of 0's.Running A from the end count for the each 1 in A the sum of B after its position.使并行 int 数组 C 为 0。从 A 中每个 1 的结束计数开始运行 A,在其位置之后 B 的总和。 The result put into the appropriate place in C.结果放入C中的适当位置。
  4. Count the sum of C.计算 C 的总和。

That is all.就这些。 The complexity is O(N).复杂度是 O(N)。 Really, for the normal line of digits, it will take about twice the time of the shortening of the source line.真的,对于正常的数字行,缩短源行大约需要两倍的时间。

If the sequence will be longer, of , say, M members, the procedure could be repeated M times.如果序列更长,例如 M 个成员,则该过程可以重复 M 次。 And complexity will be O(MN), where N already will be the length of the shortened source string.并且复杂度将是 O(MN),其中 N 已经是缩短的源字符串的长度。

psh. ps。 O(n) solutions are way better. O(n) 解决方案更好。

Think of it by building a tree:通过构建一棵树来考虑它:

iterate along the string if the character is '1', add a node to the the root of the tree.如果字符为“1”,则沿字符串迭代,向树的根添加一个节点。 if the character is '2', add a child to each first level node.如果字符为“2”,则为每个第一级节点添加一个子节点。 if the character is '3', add a child to each second level node.如果字符为“3”,则为每个二级节点添加一个子节点。

return the number of third layer nodes.返回第三层节点的数量。

this would be space inefficient so why don't we just store the number of nodes a each depth:这将是空间效率低下的,所以我们为什么不只存储每个深度的节点数:

infile >> in;
long results[3] = {0};
for(int i = 0; i < in.length(); ++i) {
    switch(in[i]) {
        case '1':
        results[0]++;
        break;
        case '2':
        results[1]+=results[0];
        break;
        case '3':
        results[2]+=results[1];
        break;
        default:;
    }
}

cout << results[2] << endl;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM