掃描文件並收集與模式匹配的完整單詞

Question

我正在一個項目中，我需要掃描一個文件夾並掃描每個文件中的特定單詞（說“ @MyPattern”）。

我期待着設計這種方案的最佳方法。 首先，我一直在如下工作：

    //Read File
    List<String> lines = new ArrayList<>();
    try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
        stream.forEach(line-> lines.add(line));
    } catch (IOException e) {
        e.printStackTrace();
    }

    //Create a pattern to find for
    Predicate<String> patternFilter = Pattern
            .compile("@MyPattern^(.+)")
            .asPredicate();

    //Apply predicate filter
    List<String> desiredWordsMatchingPattern = lines
            .stream()
            .filter(patternFilter)
            .collect(Collectors.<String>toList());

    //Perform desired operation
    desiredWordsMatchingPattern.forEach(System.out::println);

我不確定為什么即使文件中有多個匹配“ @MyPattern”的單詞也無法正常工作。

Answer 1

在正則表達式中，使用^(.+)方式沒有意義。 ^匹配字符串（行）的開頭，但是字符串的開頭不能位於模式之后（僅當模式將匹配空字符串時，它才不在此處）。 因此，您的模式永遠無法匹配任何行。

只需使用：

        Predicate<String> patternFilter = Pattern
                .compile("@MyPattern")
                .asPredicate();

如果您要求在模式之后（甚至不是空格）后面都沒有字符，則$匹配字符串的結尾：

        Predicate<String> patternFilter = Pattern
                .compile("@MyPattern$")
                .asPredicate();

Answer 2

這是我的解決方案：

    // can extract annotation and text-inside-parentheses
    private static final String REGEX = "@(\\w+)\\((.+)\\)";


    //Read File
    List<String> lines = Files.readAllLines(Paths.get(filename));

    //Create a pattern to find for
    Pattern pattern = Pattern.compile(REGEX);

    // extractor function uses pattern's second group (text-within-parentheses)
    Function<String, String> extractOnlyTextWithinParentheses = s -> {
        Matcher m = pattern.matcher(s);
        m.find();
        return m.group(2);
    };

    // all lines are filtered and text will be extracted using extractor-fn
    Stream<String> streamOfExtracted = lines.stream()
            .filter(pattern.asPredicate())
            .map(extractOnlyTextWithinParentheses);

    //Perform desired operation
    streamOfExtracted.forEach(System.out::println);

說明：

首先，讓我們澄清一下使用的正則表達式模式@(\\\\w+)\\\$(.+)\\\$應該做什么：

ASSUMING：您為類似於Java的注釋（如@MyPattern過濾文本

使用正則表達式匹配特定行

@\\\\w+匹配符號后跟一個單詞（ \\\\w是特殊含義，代表單詞，即字母和下划線）。 因此它將匹配任何注釋（例如@Trace ， @User等）。
\\\$.+\\\$與括號內的某些文本匹配（例如("10869") ，其中括號內的任何非空文本也必須轉義\\\$和\\\$和.+

注意：未轉義的括號在任何正則表達式中都有特殊含義，即分組和捕獲

要匹配括號並提取其內容，請參見Pattern上的此答案以提取括號之間的文本。

使用正則表達式內的捕獲組提取文本

只需使用括號（未轉義）即可組成一個組並記住其順序號。 (grouped)(Regex)將匹配文本groupedRegex並可以提取兩組：

組＃1： grouped
組＃2：正則Regex要獲取這些組，請使用matcher.find() ，然后使用matcher.group()或其重載方法。

測試正則表達式和提取的選項

在IntelliJ中時，您可以使用IntelliJ中的Check RegExp操作： ALT +所選正則表達式上的Enter以測試和調整它。 類似地，有很多網站可以測試正則表達式。 例如， http://www.regExPlanet.com也支持Java-RegEx-Syntax，您可以在線驗證提取的組。 請參閱RegexPlanet上的示例。

注意：插入符除了上面提到的Ole之類的開始以外還有一種特殊的含義：此[^)]+表示匹配所有字符（至少1個字符）， 但右括號除外

使用提取器功能使其可擴展

如果通過以下方式替換用作上述.map(..)參數的extract-Function，則還可以同時打印批注名稱和text-inside-括號（制表符分隔）：

Function<String, String> extractAnnotationAndTextWithinParentheses = s -> {
        Matcher m = pattern.matcher(s);
        m.find();
        StringBuilder sb = new StringBuilder();
        int lastGroup = m.groupCount();
        for (int i = 1; i <= lastGroup; i++) {
            sb.append(m.group(i));
            if (i < lastGroup) sb.append("\t");
        }
        return sb.toString();
};

摘要：

您的流媒體播放有效 。 您的正則表達式有一個錯誤 ：

它幾乎與常量注釋匹配，即@MyPattern
您嘗試使用括號捕獲修正
正則表達式中的插入符號^內有語法錯誤或錯字
不使用轉義括號\\\$和\\\$您不僅會在文本內而且還會將括號提取

掃描文件並收集與模式匹配的完整單詞

問題描述

2 個解決方案

解決方案1
2 2019-01-31 12:15:47

解決方案2
2 已采納 2019-01-31 15:40:23

說明：

使用正則表達式匹配特定行

使用正則表達式內的捕獲組提取文本

測試正則表達式和提取的選項

使用提取器功能使其可擴展

摘要：

掃描文件並收集與模式匹配的完整單詞

問題描述

2 個解決方案

解決方案1 2 2019-01-31 12:15:47

解決方案2 2 已采納 2019-01-31 15:40:23

說明：

使用正則表達式匹配特定行

使用正則表達式內的捕獲組提取文本

測試正則表達式和提取的選項

使用提取器功能使其可擴展

摘要：

解決方案1
2 2019-01-31 12:15:47

解決方案2
2 已采納 2019-01-31 15:40:23