使用Regex C＃在文本文件中搜索一些短語

Question

任務：

編寫一個程序，計算文本文件中的短語。 任何字符序列都可以用作計數短語，甚至包含分隔符的序列也可以。 例如，在文本“我是索非亞的學生”中，短語“ s”，“ stu”，“ a”和“我是”分別被發現2、1、3和1次。

我知道用string.IndexOf或LINQ或某種類型的算法（如Aho-Corasick）的解決方案。 我想對Regex做同樣的事情。

到目前為止，這是我所做的：

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace CountThePhrasesInATextFile
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = ReadInput("file.txt");
            input.ToLower();
            List<string> phrases = new List<string>();
            using (StreamReader reader = new StreamReader("words.txt"))
            {
                string line = reader.ReadLine();
                while (line != null)
                {
                    phrases.Add(line.Trim());
                    line = reader.ReadLine();
                }
            }
            foreach (string phrase in phrases)
            {
                Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
                int mathes = regex.Matches(input).Count;
                Console.WriteLine(phrase + " ----> " + mathes);
            }
        }

        private static string ReadInput(string fileName)
        {
            string output;
            using (StreamReader reader = new StreamReader(fileName))
            {
                output  = reader.ReadToEnd();
            }
            return output;
        }
    }
}

我知道我的正則表達式不正確，但是我不知道要更改什么。

輸出：

Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0

正確的輸出：

Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3

file.txt包含：

Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD

words.txt包含：

Word
S
MissingWord
DS
aa

Answer 1

這是發生了什么事。 我將以Word為例。

您為“ word”構建的正則表達式為“。word 。 ”。 它告訴正則表達式匹配任何以任何東西開頭，包含“單詞”並以任何東西結尾的東西。

為您的輸入，它匹配

字？ 我們只有幾個字：第一個字，第二個字，第三個字。

以"Word? We have few words: first" ", second word, third word."以", second word, third word."結尾", second word, third word."

然后第二行以"Some pass"開頭，包含"word"並以": PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD"

所以計數是2

您想要的正則表達式很簡單，字符串"word"就足夠了。

更新：

對於忽略大小寫模式，請嘗試"(?i)word"

對於AAaA中的多個匹配項，請嘗試"(?i)(?<=a)a"

?<=是零寬度正向后斷言

Answer 2

您需要先發布file.txt的內容，否則很難驗證正則表達式是否正常工作。

話雖如此，請在此處查看Regex答案：在C＃中查找大字符串中子字符串的所有位置，然后查看這是否對您的代碼有幫助。

編輯：

因此，有一個簡單的解決方案，在每個短語中添加“（？=（“和”））”。 這是正則表達式中的先行斷言。 以下代碼處理所需的內容。

        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }

您也有一個問題

input.ToLower();

應該改為

input = input.ToLower();

因為C＃中的字符串是不可變的。 總的來說，您的代碼應為：

    static void Main(string[] args) {
        string input = ReadInput("file.txt");
        input = input.ToLower();
        List<string> phrases = new List<string>();
        using (StreamReader reader = new StreamReader("words.txt")) {
            string line = reader.ReadLine();
            while (line != null) {
                phrases.Add(line.Trim());
                line = reader.ReadLine();
            }
        }
        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }
        Thread.Sleep(50000);
    }

    private static string ReadInput(string fileName) {
        string output;
        using (StreamReader reader = new StreamReader(fileName)) {
            output = reader.ReadToEnd();
        }
        return output;
    }

Answer 3

試試這個代碼：

string input = File.ReadAllText("file.txt");

foreach (string word in File.ReadLines("words.txt"))
{
    var regex = new Regex(word, RegexOptions.IgnoreCase);
    int startat = 0;
    int count = 0;

    Match match = regex.Match(input, startat);
    while (match.Success)
    {
        count++;
        startat = match.Index + 1;
        match = regex.Match(input, startat);
    }

    Console.WriteLine(word + "\t" + count);
}

要正確找到所有類似“ aa”的子字符串，必須使用帶有startat參數的重載Match方法。

請注意RegexOptions.IgnoreCase參數。

簡短但不清楚的代碼：

Match match;
while ((match = regex.Match(input, startat)).Success)
{
    count++;
    startat = match.Index + 1;
}

使用Regex C＃在文本文件中搜索一些短語

問題描述

3 個解決方案

解決方案1
2 2016-08-15 15:55:45

解決方案2
1 已采納 2016-08-15 15:51:41

解決方案3
1 2016-08-15 17:29:54

使用Regex C＃在文本文件中搜索一些短語

問題描述

3 個解決方案

解決方案1 2 2016-08-15 15:55:45

解決方案2 1 已采納 2016-08-15 15:51:41

解決方案3 1 2016-08-15 17:29:54

解決方案1
2 2016-08-15 15:55:45

解決方案2
1 已采納 2016-08-15 15:51:41

解決方案3
1 2016-08-15 17:29:54