如何使用正則表達式從文檔中去除打印控制代碼（PCL 類型）

Question

我有一個 PCL 文件存檔。 我想制作一個控制台應用程序，它可以讀取文件，去除所有打印控制代碼，並將代碼寫入一個單獨的文件，讓文檔的其余部分保持不變。 我想我可以用 regex() 來做到這一點，但我不確定如何完成這項任務。 我選擇的語言是 C#。 您可以提供的任何建議將不勝感激。

我已經取得了進展

    public static string RemoveBetween(string s, char begin, char end)
    {
        Regex regex = new Regex(string.Format("\\{0}.*?{1}", begin, end));
        return regex.Replace(s, string.Empty);
    }

    public static string[] getPclCodes(string line)
    {
        string pattern = "\\x1B.*?H";
        string[] pclCodes = Regex.Split(line, pattern);

        return pclCodes;
    }

但代碼返回為空字符串。 我可以將它們從 PCL 中剝離出來並編寫一個 txt 文件，但我也需要代碼。 我在 RemoveBetween 之前調用了 getPclCodes。 有任何想法嗎？

Answer 1

如果我理解正確。 這應該可以解決問題。 我修改了您的方法以接受您想要由模式掃描的行和對 MatchCollection 的引用。 這樣，您可以在分割線之前簡單地將引用分配給匹配項。

    public static string[] getPclCodes(string line, out MatchCollection codes)
    {
        string pattern = "\\x1B.*?H";

        Regex regex = new Regex(pattern);
        codes = regex.Matches(line);

        string[] pclCodes = Regex.Split(line, pattern);

        return pclCodes;
    }

所以現在，在你的主要或你稱之為 getPclCodes 的地方，你可以做這樣的事情。

        MatchCollection matches;
        string[] codes = getPclCodes(codeString, out matches);

        foreach (Match match in matches)
            Console.WriteLine(match.Value);

我相信有更好的方法，但這又有效……如果我們在同一頁面上。

Answer 2

OP 大概想要 C#，但如果其他人只想要它使用 GNU sed，這有效：

sed 's/\x1B[^][@A-Z^\\]*[][@A-Z^\\]//g'

它是如何工作的：在每一行中查找並刪除以 ESC ( \\x1B ) 開頭的任何字符序列，並繼續直到任何 ASCII 字符 64-94（即 AZ 或任何@[\\]^ ）。 尾隨g表示重復直到不再匹配為止。

如何使用正則表達式從文檔中去除打印控制代碼（PCL 類型）

問題描述

2 個解決方案

解決方案1
0 2012-12-12 09:43:50

解決方案2
0 2021-02-28 08:10:17

如何使用正則表達式從文檔中去除打印控制代碼（PCL 類型）

問題描述

2 個解決方案

解決方案1 0 2012-12-12 09:43:50

解決方案2 0 2021-02-28 08:10:17

解決方案1
0 2012-12-12 09:43:50

解決方案2
0 2021-02-28 08:10:17