简体   繁体   English

查找字符串中连续和非连续表达式的次数

[英]Finding the Number of Times an Expression Occurs in a String Continuously and Non Continuously

I had a coding interview over the phone and was asked this question: 我通过电话进行了编码采访,并被问到这个问题:

Given a String (for example): 给定一个String(例如):

"aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc" “aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc”

and an expression (for example): 和表达式(例如):

"a+b+c-" “A + B + C-”

where: 哪里:

+: means the char before it is repeated 2 times +:表示重复2次之前的字符

-: means the char before it is repeated 4 times - :表示在重复4次之前的char

Find the number of times the given expression appears in the string with the operands occurring non continuously and continuously. 查找给定表达式出现在字符串中的次数,其中操作数不连续且连续地发生。

The above expression occurs 4 times: 上面的表达式发生了4次:

1) aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc
        ^^       ^^       ^^^^                    
        aa       bb       cccc
2) aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc
        ^^       ^^                               ^^^^
        aa       bb                               cccc

3) aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc
        ^^                                ^^      ^^^^
        aa                                bb      cccc

4) aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc
                                       ^^ ^^      ^^^^
                                       aa bb      cccc

I had no idea how to do it. 我不知道该怎么做。 I started doing an iterative brute force method with lots of marking of indices but realized how messy and hard that would to code half way through: 我开始做一个带有大量索引标记的迭代强力方法,但实现了编程中途的混乱和难度:

import java.util.*;

public class Main {

    public static int count(String expression, String input) {
        int count = 0;
        ArrayList<char[]> list = new ArrayList<char[]>();

        // Create an ArrayList of chars to iterate through the expression and match to string
        for(int i = 1; i<expression.length(); i=i+2) {
            StringBuilder exp = new StringBuilder();
            char curr = expression.charAt(i-1);
            if(expression.charAt(i) == '+') {
                exp.append(curr).append(curr);
                list.add(exp.toString().toCharArray());
            }
            else { // character is '-'
                exp.append(curr).append(curr).append(curr).append(curr);
                list.add(exp.toString().toCharArray());
            }
        }

        char[] inputArray = input.toCharArray();
        int i = 0; // outside pointer
        int j = 0; // inside pointer
        while(i <= inputArray.length) {
            while(j <= inputArray.length) {
                for(int k = 0; k< list.size(); k++) {
                    /* loop through 
                     * all possible combinations in array list
                     * with multiple loops
                     */
                }
                j++;
            }
            i++;
            j=i;
        }
        return count;
    }

    public static void main(String[] args) {
        String expression = "a+b+c-";
        String input = "aaksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc";
        System.out.println("The expression occurs: "+count(expression, input)+" times");
    }
}

After spending a lot of time doing it iteratively he mentioned recursion and I still couldn't see a clear way doing it recursively and I wasn't able to solve the question. 在花了很多时间迭代地做了之后,他提到了递归,我仍然看不到一个明确的方式递归地做,我无法解决问题。 I am trying to solve it now post-interview and am still not sure how to go about this question. 我现在试图在面试后解决它,但我仍然不确定如何解决这个问题。 How should I go about solving this problem? 我该如何解决这个问题? Is the solution obvious? 解决方案明显吗? I thought this was a really hard question for a coding phone interview. 我认为对于编码电话采访来说这是一个非常难的问题。

Non-recursion algorithm that requires O(m) space and operates in O(n*m) , where m is number of tokens in query: 非递归算法,需要O(m)空间并在O(n * m)中运行 ,其中m是查询中的标记数:

@Test
public void subequences() {

    String input = "aabbccaacccccbbd";
    String query = "a+b+";

    // here to store tokens of a query: e.g. {a, +}, {b, +}
    char[][] q = new char[query.length() / 2][];

    // here to store counts of subsequences ending by j-th token found so far
    int[] c =  new int[query.length() / 2];   // main
    int[] cc = new int[query.length() / 2];   // aux        

    // tokenize
    for (int i = 0; i < query.length(); i += 2)
        q[i / 2] = new char[] {query.charAt(i), query.charAt(i + 1)};

    // init
    char[] sub2 = {0, 0};        // accumulator capturing last 2 chars
    char[] sub4 = {0, 0, 0, 0};  // accumulator capturing last 4 chars

    // main loop
    for (int i = 0; i < input.length(); i++) {

        shift(sub2, input.charAt(i));
        shift(sub4, input.charAt(i));

        boolean all2 = sub2[1] != 0 && sub2[0] == sub2[1];  // true if all sub2 chars are same
        boolean all4 = sub4[3] != 0 && sub4[0] == sub4[1]   // true if all sub4 chars are same
              && sub4[0] == sub4[2] && sub4[0] == sub4[3];

        // iterate tokens
        for (int j = 0; j < c.length; j++) {

            if (all2 && q[j][1] == '+' && q[j][0] == sub2[0]) // found match for "+" token
                cc[j] = j == 0             // filling up aux array
                      ? c[j] + 1           // first token, increment counter by 1
                      : c[j] + c[j - 1];   // add value of preceding token counter

            if (all4 && q[j][1] == '-' && q[j][0] == sub4[0]) // found match for "-" token
                cc[j] = j == 0 
                      ? c[j] + 1 
                      : c[j] + c[j - 1];
        }
        if (all2) sub2[1] = 0;  // clear, to make "aa" occur in "aaaa" 2, not 3 times
        if (all4) sub4[3] = 0;
        copy(cc, c);            // copy aux array to main 
        }
    }
    System.out.println(c[c.length - 1]);
}


// shifts array 1 char left and puts c at the end
void shift(char[] cc, char c) {
    for (int i = 1; i < cc.length; i++)
        cc[i - 1] = cc[i];
    cc[cc.length - 1] = c;
}

// copies array contents 
void copy(int[] from, int[] to) {
    for (int i = 0; i < from.length; i++)
        to[i] = from[i];
}

The main idea is to catch chars from the input one by one, holding them in 2- and 4-char accumulators and check if any of them match some tokens of the query, remembering how many matches have we got for sub-queries ending by these tokens so far. 主要思想是逐个从输入中捕获字符,将它们保存在2和4字符累加器中并检查它们是否与查询的某些标记匹配,记住我们获得的子查询结束的子匹配数量到目前为止这些令牌。

Query ( a+b+c- ) is splitted into tokens ( a+ , b+ , c- ). 查询( a+b+c- )被分成令牌( a+b+c- )。 Then we collect chars in accumulators and check if they match some tokens. 然后我们收集累加器中的字符并检查它们是否与某些令牌匹配。 If we find match for first token, we increment its counter by 1. If we find match for another j-th token , we can create as many additional subsequences matching subquery composed of tokens [0...j] , as many of them now exist for subquery composed of tokens [0... j-1] , because this match can be appended to every of them. 如果我们找到第一个令牌的匹配,我们将其计数器增加1.如果我们找到另一个第j个令牌的匹配,我们可以创建与由令牌[0 ... j]组成的子查询匹配的更多序列,其中很多现在存在由标记[0 ... j-1]组成的子查询 ,因为这个匹配可以附加到每个子标记

For example, we have: 例如,我们有:

a+ : 3  (3 matches for a+)
b+ : 2  (2 matches for a+b+)
c- : 1  (1 match for a+b+c-) 

when cccc arrives. cccc到达时。 Then c- counter should be increased by b+ counter value, because so far we have 2 a+b+ subsequences and cccc can be appended to both of them. 然后c-计数器应该增加b+计数器值,因为到目前为止我们有2个a+b+序列,并且cccc可以附加到它们两者。

Let's call the length of the string n, and the length of the query expression (in terms of the number of "units", like a+ or b- ) m. 让我们调用字符串n的长度和查询表达式的长度(以“单位”的数量表示,如a+b- )m。

It's not clear exactly what you mean by "continuously" and "non-continuously", but if "continuously" means that there can't be any gaps between query string units, then you can just use the KMP algorithm to find all instances in O(m+n) time. 目前尚不清楚“连续”和“非连续”是什么意思,但如果“连续”意味着查询字符串单元之间不存在任何差距,那么您可以使用KMP算法查找所有实例O(m + n)时间。

We can solve the "non-continuous" version in O(nm) time and space with dynamic programming . 我们可以通过动态编程在O(nm)时间和空间中解决“非连续”版本。 Basically, what we want to compute is a function: 基本上,我们想要计算的是一个函数:

f(i, j) = the number of occurrences of the subquery consisting of the first i units
          of the query expression, in the first j characters of the string.

So with your example, f(2, 41) = 2, since there are 2 separate occurrences of the subpattern a+b+ in the first 41 characters of your example string. 因此,对于您的示例,f(2,41)= 2,因为在示例字符串的前41个字符中有2个单独出现的子模式a+b+

The final answer will then be f(n, m). 最后的答案将是f(n,m)。

We can compute this recursively as follows: 我们可以递归计算这个,如下所示:

f(0, j) = 0
f(i, 0) = 0
f(i > 0, j > 0) = f(i, j-1) + isMatch(i, j) * f(i-1, j-len(i))

where len(i) is the length of the ith unit in the expression (always 2 or 4) and isMatch(i, j) is a function that returns 1 if the ith unit in the expression matches the text ending at position j, and 0 otherwise. 其中len(i)是表达式中第i个单位的长度(总是2或4)而isMatch(i, j)是一个函数,如果表达式中的第i个单位与结束于位置j的文本匹配,则返回1,否则为0。 For example, isMatch(15, 2) = 1 in your example, because s[14..15] = bb . 例如,在您的示例中isMatch(15, 2) = 1,因为s [14..15] = bb This function takes just constant time to run, because it never needs to check more than 4 characters. 此函数只需要恒定的运行时间,因为它永远不需要检查超过4个字符。

The above recursion will already work as-is, but we can save time by making sure that we only solve each subproblem once. 上面的递归已经按原样运行,但我们可以确保只解决每个子问题一次,从而节省时间。 Because the function f() depends only on its 2 parameters i and j, which range between 0 and m, and between 0 and n, respectively, we can just compute all n*m possible answers and store them in a table. 因为函数f()仅依赖于它的2个参数i和j,它们分别在0和m之间,以及0和n之间,我们可以计算所有n * m个可能的答案并将它们存储在表中。

[EDIT: As Sasha Salauyou points out, the space requirement can in fact be reduced to O(m). [编辑:正如Sasha Salauyou指出的那样,空间要求实际上可以减少到O(m)。 We never need to access values of f(i, k) with k < j-1, so instead of storing m columns in the table we can just store 2, and alternate between them by always accessing column m % 2 .] 我们永远不需要使用k <j-1来访问f(i,k)的值,因此我们可以只存储2,而不是在表中存储m列,并通过始终访问列m % 2在它们之间交替。

Wanted to try it for myself and figured I could then share my solution as well. 想亲自尝试一下,并想我也可以分享我的解决方案。 The parse method obviously has issues when there is indeed a char 0 in the expression (although that would probably be the bigger issue itself), the find method will fail for an empty needles array and I wasn't sure if ab+c- should be considered a valid pattern (I treat it as such). 当表达式中确实存在char 0时, parse方法显然存在问题(虽然这可能是更大的问题本身),但对于空needles阵列, find方法将失败并且我不确定是否ab+c-应该被视为有效模式(我将其视为有效模式)。 Note that this covers only the non-continous part so far. 请注意,到目前为止,这仅涵盖非连续部分。

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Matcher {

  public static void main(String[] args) {
    String haystack = "aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc";
    String[] needles = parse("a+b+c-");
    System.out.println("Needles: " + Arrays.toString(needles));
    System.out.println("Found: " + find(haystack, needles, 0));
    needles = parse("ab+c-");
    System.out.println("Needles: " + Arrays.toString(needles));
    System.out.println("Found: " + find(haystack, needles, 0));
  }

  private static int find(String haystack, String[] needles, int i) {
    String currentNeedle = needles[i];
    int pos = haystack.indexOf(currentNeedle);
    if (pos < 0) {
      // Abort: Current needle not found
      return 0;
    }
    // Current needle found (also means that pos + currentNeedle.length() will always
    // be <= haystack.length()
    String remainingHaystack = haystack.substring(pos + currentNeedle.length());
    // Last needle?
    if (i == needles.length - 1) {
      // +1: We found one match for all needles
      // Try to find more matches of current needle in remaining haystack
      return 1 + find(remainingHaystack, needles, i);
    }
    // Try to find more matches of current needle in remaining haystack
    // Try to find next needle in remaining haystack
    return find(remainingHaystack, needles, i) + find(remainingHaystack, needles, i + 1);
  }

  private static String[] parse(String expression) {
    List<String> searchTokens = new ArrayList<String>();
    char lastChar = 0;
    for (int i = 0; i < expression.length(); i++) {
      char c = expression.charAt(i);
      char[] chars;
      switch (c) {
        case '+':
          // last char is repeated 2 times
          chars = new char[2];
          Arrays.fill(chars, lastChar);
          searchTokens.add(String.valueOf(chars));
          lastChar = 0;
          break;
        case '-':
          // last char is repeated 4 times
          chars = new char[4];
          Arrays.fill(chars, lastChar);
          searchTokens.add(String.valueOf(chars));
          lastChar = 0;
          break;
        default:
          if (lastChar != 0) {
            searchTokens.add(String.valueOf(lastChar));
          }
          lastChar = c;
      }
    }
    return searchTokens.toArray(new String[searchTokens.size()]);
  }
}

Output: 输出:

Needles: [aa, bb, cccc]
Found: 4
Needles: [a, bb, cccc]
Found: 18

Recursion may be the following (pseudocode): 递归可能是以下(伪代码):

int search(String s, String expression) {

     if expression consists of only one token t /* e. g. "a+" */ {
         search for t in s
         return number of occurrences
     } else {
         int result = 0
         divide expression into first token t and rest expression
         // e. g. "a+a+b-" -> t = "a+", rest = "a+b-"
         search for t in s
         for each occurrence {
             s1 = substring of s from the position of occurrence to the end
             result += search(s1, rest) // search for rest of expression in rest of string
         }
         return result
     }
}   

Applying this to entire string, you'll get number of non-continuous occurrences. 将此应用于整个字符串,您将获得非连续出现的数量。 To get continuous occurrences, you don't need recursion at all--just transform expression into string and search by iteration. 要获得连续出现,您根本不需要递归 - 只需将表达式转换为字符串并通过迭代进行搜索。

How about preprocessing aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc? 如何预处理aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc?

This become a1k1s1d1b1a2l1a1s1k1d1h1f1b2l1a1j1d1f1h1a1c4a1o1u1d1g1a1l1s1a2b2l1i1s1d1f1h1c4 这变成a1k1s1d1b1a2l1a1s1k1d1h1f1b2l1a1j1d1f1h1a1c4a1o1u1d1g1a1l1s1a2b2l1i1s1d1f1h1c4

Now find occurrences of a2, b2, c4. 现在找到a2,b2,c4的出现次数。

Tried it code below but right now it gives only first possible match based of depth first. 尝试下面的代码,但现在它首先只提供基于深度的第一个可能的匹配。

Need to be changed to do all possible combination instead of just first 需要改变以完成所有可能的组合,而不仅仅是第一次

import java.util.ArrayList;
import java.util.List;

public class Parsing {
    public static void main(String[] args) {
        String input = "aksdbaalaskdhfbblajdfhacccc aoudgalsaa bblisdfhcccc";
        System.out.println(input);

        for (int i = 0; i < input.length(); i++) {
            System.out.print(i/10);
        }
        System.out.println();

        for (int i = 0; i < input.length(); i++) {
            System.out.print(i%10);
        }
        System.out.println();

        List<String> tokenisedSearch = parseExp("a+b+c-");
        System.out.println(tokenisedSearch);

        parse(input, 0, tokenisedSearch, 0);
    }

    public static boolean parse(String input, int searchFromIndex, List<String> tokensToSeach, int currentTokenIndex) {
        if(currentTokenIndex >= tokensToSeach.size())
            return true;
        String token = tokensToSeach.get(currentTokenIndex);
        int found = input.indexOf(token, searchFromIndex);
        if(found >= 0) {
            System.out.println("Found at Index "+found+ " Token " +token);
            return parse(input, searchFromIndex+1, tokensToSeach, currentTokenIndex+1);
        }
        return false;
    }

    public static List<String> parseExp(String exp) {
        List<String> list = new ArrayList<String>();
        String runningToken = "";
        for (int i = 0; i < exp.length(); i++) {
            char at = exp.charAt(i);
            switch (at) { 
            case '+' :
                runningToken += runningToken;
                list.add(runningToken);
                runningToken = "";
                break;
            case '-' :
                runningToken += runningToken;
                runningToken += runningToken;
                list.add(runningToken);
                runningToken = "";
                break;
            default :
                runningToken += at;
            }
        }
        return list;
    }
}

If you convert the search string first with a simple parser/compiler so a+ becomes aa etc. then you can simply take this string and run a regular expression match against your hay stack. 如果您首先使用简单的解析器/编译器转换搜索字符串,以便a+变为aa等,那么您可以简单地使用此字符串并针对您的干草堆运行正则表达式匹配。 (Sorry, I'm no Java coder so can't deliver any real code but it is not really difficult) (对不起,我不是Java编码器,因此不能提供任何真正的代码,但这并不困难)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM