使用 Java Regex，如何检查字符串是否包含集合中的任何单词？

Question

我有一套话要说——苹果、橙子、梨、香蕉、猕猴桃

我想检查一个句子是否包含上面列出的任何单词，如果包含，我想找到匹配的单词。 我怎样才能在 Regex 中做到这一点？

我目前正在为我的每个单词集调用 String.indexOf()。 我假设这不如正则表达式匹配有效？

Answer 1

TL;DR对于简单的子字符串contains()是最好的，但对于仅匹配整个单词，正则表达式可能更好。

查看哪种方法更有效的最佳方法是对其进行测试。

您可以使用String.contains()而不是String.indexOf()来简化您的非正则表达式代码。

要搜索不同的单词，正则表达式如下所示：

apple|orange|pear|banana|kiwi

的| 在正则表达式中用作OR 。

我非常简单的测试代码如下所示：

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

我得到的结果如下：

Contains took 5962ms
Regular Expression took 63475ms

显然，时间会因搜索的单词数和搜索的字符串而异，但对于像这样的简单搜索， contains()似乎比正则表达式快 10 倍。

通过使用正则表达式在另一个字符串中搜索字符串，您正在使用大锤敲碎坚果，所以我想我们不应该对它变慢感到惊讶。 当您要查找的模式更复杂时，请保存正则表达式。

您可能想要使用正则表达式的一种情况是indexOf()和contains()无法完成这项工作，因为您只想匹配整个单词而不仅仅是子字符串，例如您想匹配pear而不是spears 。 正则表达式可以很好地处理这种情况，因为它们具有单词边界的概念。

在这种情况下，我们将模式更改为：

\b(apple|orange|pear|banana|kiwi)\b

\\b表示只匹配单词的开头或结尾，括号将 OR 表达式组合在一起。

请注意，在您的代码中定义此模式时，您需要使用另一个反斜杠对反斜杠进行转义：

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

Answer 2

我不认为正则表达式在性能方面会做得更好，但您可以按如下方式使用它：

Pattern p = Pattern.compile("(apple|orange|pear)");
Matcher m = p.matcher(inputString);
while (m.find()) {
   String matched = m.group(1);
   // Do something
}

Answer 3

这是我找到的最简单的解决方案（与通配符匹配）：

boolean a = str.matches(".*\\b(wordA|wordB|wordC|wordD|wordE)\\b.*");

使用 Java Regex，如何检查字符串是否包含集合中的任何单词？

问题描述

3 个解决方案

解决方案1
49 已采纳 2012-03-01 12:27:19

解决方案2
7 2012-03-01 11:52:58

解决方案3
4 2017-02-13 16:37:45

使用 Java Regex，如何检查字符串是否包含集合中的任何单词？

问题描述

3 个解决方案

解决方案1 49 已采纳 2012-03-01 12:27:19

解决方案2 7 2012-03-01 11:52:58

解决方案3 4 2017-02-13 16:37:45

解决方案1
49 已采纳 2012-03-01 12:27:19

解决方案2
7 2012-03-01 11:52:58

解决方案3
4 2017-02-13 16:37:45