简体   繁体   English

检测字符串是否包含多个单词的更好方法

[英]Better way to detect if a string contains multiple words

I am trying to create a program that detects if multiple words are in a string as fast as possible, and if so, executes a behavior.我正在尝试创建一个程序,以尽可能快地检测字符串中是否有多个单词,如果是,则执行一个行为。 Preferably, I would like it to detect the order of these words too but only if this can be done fast.最好,我也希望它能够检测这些单词的顺序,但前提是这可以快速完成。 So far, this is what I have done:到目前为止,这就是我所做的:

if (input.contains("adsf") && input.contains("qwer")) {
    execute();          
}

As you can see, doing this for multiple words would become tiresome.如您所见,对多个单词执行此操作会变得很烦人。 Is this the only way or is there a better way of detecting multiple substrings?这是检测多个子字符串的唯一方法还是有更好的方法? And is there any way of detecting order?有什么方法可以检测顺序吗?

I'd create a regular expression from the words:我会从以下单词创建一个正则表达式:

Pattern pattern = Pattern.compile("(?=.*adsf)(?=.*qwer)");
if (pattern.matcher(input).find()) {
    execute();
}

For more details, see this answer: https://stackoverflow.com/a/470602/660143有关更多详细信息,请参阅此答案: https ://stackoverflow.com/a/470602/660143

Editors note:<\/strong> Despite being heavily upvoted and accepted, this does not function the same as the code in the question.编者注:<\/strong>尽管被大力支持和接受,但它的功能与问题中的代码不同。 execute<\/code> is called on the first match, like a logical OR.在第一次匹配时调用execute<\/code> ,就像逻辑 OR 一样。

<\/blockquote>

You could use an array:你可以使用一个数组:

 String[] matches = new String[] {"adsf", "qwer"}; bool found = false; for (String s : matches) { if (input.contains(s)) { execute(); break; } }<\/code><\/pre>

This is efficient as the one posted by you but more maintainable.这与您发布的一样有效,但更易于维护。 Looking for a more efficient solution sounds like a micro optimization that should be ignored until proven to be effectively a bottleneck of your code, in any case with a huge string set the solution could be a trie.寻找一个更有效的解决方案听起来像是一个微优化,在被证明是你的代码的有效瓶颈之前应该忽略它,在任何情况下,如果有一个巨大的字符串集,这个解决方案可能是一个尝试。

"

In Java 8 you could do在 Java 8 中你可以做

public static boolean containsWords(String input, String[] words) {
    return Arrays.stream(words).allMatch(input::contains);
}

If you have a lot of substrings to look up, then a regular expression probably isn't going to be much help, so you're better off putting the substrings in a list, then iterating over them and calling input.indexOf(substring) on each one.如果您有很多子字符串要查找,那么正则表达式可能不会有太大帮助,因此您最好将子字符串放在一个列表中,然后遍历它们并调用input.indexOf(substring)在每一个上。 This returns an int index of where the substring was found.这将返回找到子字符串的int索引。 If you throw each result (except -1, which means that the substring wasn't found) into a TreeMap (where index is the key and the substring is the value), then you can retrieve them in order by calling keys() on the map.如果您将每个结果(-1 除外,这意味着未找到子字符串)放入TreeMap (其中index是键,子字符串是值),那么您可以通过调用keys()来按顺序检索它们地图。

Map<Integer, String> substringIndices = new TreeMap<Integer, String>();
List<String> substrings = new ArrayList<String>();
substrings.add("asdf");
// etc.

for (String substring : substrings) {
  int index = input.indexOf(substring);

  if (index != -1) {
    substringIndices.put(index, substring);
  }
}

for (Integer index : substringIndices.keys()) {
  System.out.println(substringIndices.get(index));
}

Use a tree structure to hold the substrings per codepoint.使用树结构来保存每个代码点的子字符串。 This eliminates the need to这消除了需要

Note that this is efficient only if the needle set is almost constant.请注意,这只有在针组几乎恒定时才有效。 It is not inefficient if there are individual additions or removals of substrings though, but a different initialization each time to arrange a lot of strings into a tree structure would definitely slower it.虽然单独添加或删除子字符串并不是低效的,但是每次将大量字符串排列成树结构的不同初始化肯定会减慢它。

StringSearcher : StringSearcher

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

class StringSearcher{
    private NeedleTree needles = new NeedleTree(-1);
    private boolean caseSensitive;
    private List<Integer> lengths = new ArrayList<>();
    private int maxLength;

    public StringSearcher(List<String> inputs, boolean caseSensitive){
        this.caseSensitive = caseSensitive;
        for(String input : inputs){
            if(!lengths.contains(input.length())){
                lengths.add(input.length());
            }
            NeedleTree tree = needles;
            for(int i = 0; i < input.length(); i++){
                tree = tree.child(caseSensitive ? input.codePointat(i) : Character.toLowerCase(input.codePointAt(i)));
            }
            tree.markSelfSet();
        }
        maxLength = Collections.max(legnths);
    }

    public boolean matches(String haystack){
        if(!caseSensitive){
            haystack = haystack.toLowerCase();
        }
        for(int i = 0; i < haystack.length(); i++){
            String substring = haystack.substring(i, i + maxLength); // maybe we can even skip this and use from haystack directly?
            NeedleTree tree = needles;
            for(int j = 0; j < substring.maxLength; j++){
                tree = tree.childOrNull(substring.codePointAt(j));
                if(tree == null){
                    break;
                }
                if(tree.isSelfSet()){
                    return true;
                }
            }
        }
        return false;
    }
}

NeedleTree.java : NeedleTree.java

import java.util.HashMap;
import java.util.Map;

class NeedleTree{
    private int codePoint;
    private boolean selfSet;
    private Map<Integer, NeedleTree> children = new HashMap<>();

    public NeedleTree(int codePoint){
        this.codePoint = codePoint;
    }

    public NeedleTree childOrNull(int codePoint){
        return children.get(codePoint);
    }

    public NeedleTree child(int codePoint){
        NeedleTree child = children.get(codePoint);
        if(child == null){
            child = children.put(codePoint, new NeedleTree(codePoint));
        }
        return child;
    }

    public boolean isSelfSet(){
        return selfSet;
    }

    public void markSelfSet(){
        selfSet = true;
    }
}

This is a classical interview and CS problem.这是一个经典的面试和 CS 问题。

Robin Karp algorithm is usually what people first talk about in interviews. Robin Karp 算法通常是人们在采访中首先谈论的内容。 The basic idea is that as you go through the string, you add the current character to the hash.基本思想是,在遍历字符串时,将当前字符添加到散列中。 If the hash matches the hash of one of your match strings, you know that you might have a match.如果哈希与您的一个匹配字符串的哈希匹配,则您知道您可能有一个匹配项。 This avoids having to scan back and forth into your match strings.这避免了在匹配字符串中来回扫描。 https:\/\/en.wikipedia.org\/wiki\/Rabin%E2%80%93Karp_algorithm<\/a> https:\/\/en.wikipedia.org\/wiki\/Rabin%E2%80%93Karp_algorithm<\/a>

Other typical topics for that interview question are to consider a trie structure to speed up the lookup.该面试问题的其他典型主题是考虑使用 trie 结构来加快查找速度。 If you have a large set of match strings, you have to always check a large set of match strings.如果您有大量匹配字符串,则必须始终检查大量匹配字符串。 A trie structure is more efficient to do that check. trie 结构更有效地进行检查。 https:\/\/en.wikipedia.org\/wiki\/Trie<\/a> https:\/\/en.wikipedia.org\/wiki\/Trie<\/a>

Additional algorithms are: - Aho–Corasick https:\/\/en.wikipedia.org\/wiki\/Aho%E2%80%93Corasick_algorithm<\/a> - Commentz-Walter https:\/\/en.wikipedia.org\/wiki\/Commentz-Walter_algorithm<\/a>其他算法是: - Aho–Corasick https:\/\/en.wikipedia.org\/wiki\/Aho%E2%80%93Corasick_algorithm<\/a> - Commentz-Walter https:\/\/en.wikipedia.org\/wiki\/Commentz-Walter_algorithm<\/a>

"

I think a better approach would be something like this, where we can add multiple values as a one string and by index of function validate index我认为更好的方法是这样的,我们可以将多个值添加为一个字符串,并通过函数的索引验证索引

String s = "123"; 
System.out.println(s.indexOf("1")); // 0
System.out.println(s.indexOf("2")); // 1 
System.out.println(s.indexOf("5")); // -1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM