使用正則表達式（.NET Framework、C#）刪除所有以“**”開頭的行（注釋）

Question

我正在開發一個讀取和處理文本文件的應用程序。 這些文本文件具有以下結構：

** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6

我使用string text = File.ReadAllText(file);將整個文本文件存儲在內存中string text = File.ReadAllText(file); . 但是，我想刪除所有注釋行，即所有以"**"開頭的行。

這可以通過以下方法實現：

// this method also removes any white-spaces (this is intended)
string RemoveComments(string textWithComments)
{
    string textWithoutComments = null;

    string[] split = Regex.Split(text.Replace(" ", null), "\r\n|\r|\n").ToArray();
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments += line + "\r\n";

    return textWithoutComments;
}

然而，這對於大文件來說實際上非常慢。 我還認為可以用一行代碼（可能使用正則表達式）替換整個方法。 我怎樣才能做到這一點（我也從未使用過正則表達式）。

PS：我也想避免StreamReader s。

編輯

示例文件如下所示：

** Initial comment
*Command-0
** Some Comment: Header: Text
** Some text: text
*Command-1
**
** Some comment or text
**
*Command-2
*Command-3
      1,            2,            3
      2,            2,            4
      3,            2,            5
** END COMMENT

Answer 1

每次字符串的大小發生變化時，連接字符串都會重新分配內存。

StringBuilder 不會經常重新分配，並且會顯着減少*運行時間

string RemoveComments(string textWithComments)
{
    StringBuilder textWithoutComments = new StringBuilder();

    string[] split = text.Replace(" ", null).Split('\r', '\n');
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments.Append(line + "\r\n");

    return textWithoutComments.ToString();
}

在 Aluan 的建議中編輯

Answer 2

為什么不只是：

var text = @"** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6";

var textWithoutComments = Regex.Replace(text, @"(^|\n)\*\*.*(?=\n)", string.Empty); //this version will leave a \n at the beginning of the string if the text starts with a comment.
var textWithoutComments = Regex.Replace(text, @"(^\*\*.*\r\n)|((\r\n)\*\*.*($|(?=\r\n)))", string.Empty); //this versioh deals with that problem, for a longer regex that treats the first line differently than the other lines (consumes the \n rather than leaving it in the text)

不知道性能，我沒有准備好的測試數據......

PS：我也傾向於相信，如果你想要最佳性能，一些流媒體可能是理想的，如果這樣可以讓后面的處理更容易，你總是可以從方法中返回一個字符串。 我認為該線程中的大多數人都建議將 StreamReader 用於迭代/讀取/解釋部分，而不管您決定構建的返回類型如何。

Answer 3

我知道你說你不想使用 StreamReader，但是下面的代碼在我的電腦上可以在不到半秒的時間內處理 400,000 行。 它簡單、直接且快速。

static void RemoveCommentsAndWhitespace(string filePath)
{
    if (!File.Exists(filePath))
    {
        Console.WriteLine($"ERR: The file '{filePath}' does not exist.", nameof(filePath));
    }

    string outfile = filePath + ".out";

    using StreamReader sr = new StreamReader(filePath);
    using StreamWriter sw = new StreamWriter(outfile);
    string line;

    while ((line = sr.ReadLine()) != null)
    {
        string tmp = line.Replace(" ", string.Empty);
        if (tmp.StartsWith("**"))
        {
            continue;
        }

        sw.WriteLine(tmp);
    }

    Console.WriteLine($"Wrote to {outfile}.");
}

使用正則表達式（.NET Framework、C#）刪除所有以“**”開頭的行（注釋）

問題描述

3 個解決方案

解決方案1
2 2020-08-24 22:21:48

解決方案2
2 已采納 2020-08-24 22:23:25

解決方案3
0 2020-08-24 23:21:07

使用正則表達式（.NET Framework、C#）刪除所有以“**”開頭的行（注釋）

問題描述

3 個解決方案

解決方案1 2 2020-08-24 22:21:48

解決方案2 2 已采納 2020-08-24 22:23:25

解決方案3 0 2020-08-24 23:21:07

解決方案1
2 2020-08-24 22:21:48

解決方案2
2 已采納 2020-08-24 22:23:25

解決方案3
0 2020-08-24 23:21:07