简体   繁体   English

查找包含数组中所有单词的字符串子字符串

[英]Finding Sub-Strings of String Containing all the words in array

I have a String and an array of words and I have to write code to find all substrings of the string that contain all the words in the array in any order. 我有一个字符串和一个单词数组,我必须编写代码以查找包含该数组中所有单词的字符串的所有子字符串,并且该字符串的顺序是任意的。 The string does not contain any special characters / digits and each word is separated by a space. 该字符串不包含任何特殊字符/数字,并且每个单词都用空格分隔。

For example: 例如:

String given: 给出的字符串:

aaaa aaaa aaaa aaaa cccc bbbb bbbb bbbb bbbb aaaa bbbb cccc

Words in array: 数组中的单词:

aaaa
bbbb
cccc

Sample of output: 输出样本:

aaaa aaaa aaaa aaaa cccc bbbb bbbb bbbb bbbb    

aaaa aaaa aaaa aaaa cccc bbbb    

aaaa cccc bbbb bbbb bbbb bbbb    

cccc bbbb bbbb bbbb bbbb aaaa  

aaaa cccc bbbb

I have implemented this using for loops, but this is very inefficient. 我已经使用for循环实现了此功能,但这效率很低。

How can I do this more efficiently? 我怎样才能更有效地做到这一点?

My code: 我的代码:

    for(int i=0;i<str_arr.length;i++)
    {
        if( (str_arr.length - i) >= words.length)
        {
            String res = check(i);
            if(!res.equals(""))
            {
                System.out.println(res);
                System.out.println("");
            }
            reset_all();
        }
        else
        {
            break;
        }
    }

public static String check(int i)
{
    String res = "";
    num_words = 0;

    for(int j=i;j<str_arr.length;j++)
    {
        if(has_word(str_arr[j]))
        {
            t.put(str_arr[j].toLowerCase(), 1);
            h.put(str_arr[j].toLowerCase(), 1);

            res = res + str_arr[j]; //+ " ";

            if(all_complete())
            {
                return res;
            }

            res = res + " ";
        }
        else
        {
            res = res + str_arr[j] + " ";
        }

    }
    res = "";
    return res;
}

My first approach would be something like the following pseudo-code 我的第一种方法是类似以下的伪代码

  for word:string {
    if word in array {
      for each stored potential substring {
        if word wasnt already found {
          remove word from notAlreadyFoundList
          if notAlreadyFoundList is empty {
            use starting pos and ending pos to save our substring
          }
        }
      store position and array-word as potential substring
  }

This should have decent performance since you only traverse the string once. 这应该具有不错的性能,因为您只需要遍历字符串一次。

[EDIT] [编辑]

This is an implementation of my pseudo-code, try it out and see if it performs better or worse. 这是我的伪代码的一种实现,请尝试一下,看看它的性能好坏。 It works under the assumption that a matching substring is found as soon as you find the last word. 它假定在​​找到最后一个单词后立即找到匹配的子字符串。 If you truly want all matches, change the lines marked //ALLMATCHES : 如果您确实想要所有匹配项,请更改标记为//ALLMATCHES的行:

class SubStringFinder {
    String textString = "aaaa aaaa aaaa aaaa cccc bbbb bbbb bbbb bbbb aaaa bbbb cccc";
    Set<String> words = new HashSet<String>(Arrays.asList("aaaa", "bbbb", "cccc"));

    public static void main(String[] args) {
        new SubStringFinder();
    }

    public SubStringFinder() {
        List<PotentialMatch> matches = new ArrayList<PotentialMatch>();
        for (String textPart : textString.split(" ")) {
            if (words.contains(textPart)) {
                for (Iterator<PotentialMatch> matchIterator = matches.iterator(); matchIterator.hasNext();) {
                    PotentialMatch match = matchIterator.next();
                    String result = match.tryMatch(textPart);
                    if (result != null) {
                        System.out.println("Match found: \"" + result + "\"");
                        matchIterator.remove(); //ALLMATCHES - remove this line
                    }
                }
                Set<String> unfound = new HashSet<String>(words);
                unfound.remove(textPart);
                matches.add(new PotentialMatch(unfound, textPart));
            }// ALLMATCHES add these lines 
             // else {
             // matches.add(new PotentialMatch(new HashSet<String>(words), textPart));
             // }
        }
    }

    class PotentialMatch {
        Set<String> unfoundWords;
        StringBuilder stringPart;
        public PotentialMatch(Set<String> unfoundWords, String part) {
            this.unfoundWords = unfoundWords;
            this.stringPart = new StringBuilder(part);
        }
        public String tryMatch(String part) {
            this.stringPart.append(' ').append(part);
            unfoundWords.remove(part);                
            if (unfoundWords.isEmpty()) {
                return this.stringPart.toString();
            }
            return null;
        }
    }
}

Here is another approach: 这是另一种方法:

public static void main(String[] args) throws FileNotFoundException {
    // init
    List<String> result = new ArrayList<String>();
    String string = "aaaa aaaa aaaa aaaa cccc bbbb bbbb bbbb bbbb aaaa bbbb cccc";
    String[] words = { "aaaa", "bbbb", "cccc" };
    // find all combs as regexps (e.g. "(aaaa )+(bbbb )+(cccc )*cccc", "(aaaa )+(cccc )+(bbbb )*bbbb")
    List<String> regexps = findCombs(Arrays.asList(words));
    // compile and add
    for (String regexp : regexps) {
        Pattern p = Pattern.compile(regexp);
        Matcher m = p.matcher(string);
        while (m.find()) {
            result.add(m.group());
        }
    }
    System.out.println(result);
}

private static List<String> findCombs(List<String> words) {
    if (words.size() == 1) {
        words.set(0, "(" + Pattern.quote(words.get(0)) + " )*" + Pattern.quote(words.get(0)));
        return words;
    }
    List<String> list = new ArrayList<String>();
    for (String word : words) {
        List<String> tail = new LinkedList<String>(words);
        tail.remove(word);
        for (String s : findCombs(tail)) {
            list.add("(" + Pattern.quote(word) + " ?)+" + s);
        }
    }
    return list;
}

This will output: 这将输出:

[aaaa bbbb cccc, aaaa aaaa aaaa aaaa cccc bbbb bbbb bbbb bbbb, cccc bbbb bbbb bbbb bbbb aaaa]

I know the result is not complete: you got only the available combinaisons, fully extended , but you got all of them. 我知道结果还不完整:您只获得了完全扩展的可用组合,但是却全部获得了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM