使用 Java Regex，如何檢查字符串是否包含集合中的任何單詞？

Question

我有一套話要說——蘋果、橙子、梨、香蕉、獼猴桃

我想檢查一個句子是否包含上面列出的任何單詞，如果包含，我想找到匹配的單詞。 我怎樣才能在 Regex 中做到這一點？

我目前正在為我的每個單詞集調用 String.indexOf()。 我假設這不如正則表達式匹配有效？

Answer 1

TL;DR對於簡單的子字符串contains()是最好的，但對於僅匹配整個單詞，正則表達式可能更好。

查看哪種方法更有效的最佳方法是對其進行測試。

您可以使用String.contains()而不是String.indexOf()來簡化您的非正則表達式代碼。

要搜索不同的單詞，正則表達式如下所示：

apple|orange|pear|banana|kiwi

的| 在正則表達式中用作OR 。

我非常簡單的測試代碼如下所示：

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

我得到的結果如下：

Contains took 5962ms
Regular Expression took 63475ms

顯然，時間會因搜索的單詞數和搜索的字符串而異，但對於像這樣的簡單搜索， contains()似乎比正則表達式快 10 倍。

通過使用正則表達式在另一個字符串中搜索字符串，您正在使用大錘敲碎堅果，所以我想我們不應該對它變慢感到驚訝。 當您要查找的模式更復雜時，請保存正則表達式。

您可能想要使用正則表達式的一種情況是indexOf()和contains()無法完成這項工作，因為您只想匹配整個單詞而不僅僅是子字符串，例如您想匹配pear而不是spears 。 正則表達式可以很好地處理這種情況，因為它們具有單詞邊界的概念。

在這種情況下，我們將模式更改為：

\b(apple|orange|pear|banana|kiwi)\b

\\b表示只匹配單詞的開頭或結尾，括號將 OR 表達式組合在一起。

請注意，在您的代碼中定義此模式時，您需要使用另一個反斜杠對反斜杠進行轉義：

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

Answer 2

我不認為正則表達式在性能方面會做得更好，但您可以按如下方式使用它：

Pattern p = Pattern.compile("(apple|orange|pear)");
Matcher m = p.matcher(inputString);
while (m.find()) {
   String matched = m.group(1);
   // Do something
}

Answer 3

這是我找到的最簡單的解決方案（與通配符匹配）：

boolean a = str.matches(".*\\b(wordA|wordB|wordC|wordD|wordE)\\b.*");

使用 Java Regex，如何檢查字符串是否包含集合中的任何單詞？

問題描述

3 個解決方案

解決方案1
49 已采納 2012-03-01 12:27:19

解決方案2
7 2012-03-01 11:52:58

解決方案3
4 2017-02-13 16:37:45

使用 Java Regex，如何檢查字符串是否包含集合中的任何單詞？

問題描述

3 個解決方案

解決方案1 49 已采納 2012-03-01 12:27:19

解決方案2 7 2012-03-01 11:52:58

解決方案3 4 2017-02-13 16:37:45

解決方案1
49 已采納 2012-03-01 12:27:19

解決方案2
7 2012-03-01 11:52:58

解決方案3
4 2017-02-13 16:37:45