使用 c# 將 html 轉換為純文本

Question

我使用此方法將 html 轉換為純文本，但在此 html 標簽 <H1,2,3,..> 中有一些錯誤

方法：

public string HtmlToPlainText(string htmlText)
    {
        //const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
        const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
        const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
        var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
        var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
        //var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

        var text = htmlText;
        //Decode html specific characters
        text = System.Net.WebUtility.HtmlDecode(text);
        //Remove tag whitespace / line breaks
        //text = tagWhiteSpaceRegex.Replace(text, "><");
        //Replace < br /> with line breaks
        text = lineBreakRegex.Replace(text, Environment.NewLine);
        //Strip formatting
        text = stripFormattingRegex.Replace(text, string.Empty);
        return text;
    }

這是我的 html 文本：
 <h3> This is a simple title </h3> </br> <p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>
這是我的結果：

這是一個簡單的標題 Lorem ipsum dolor sit amet consectetur，
減肥精英。

結果應該是：

這是一個簡單的標題

Lorem ipsum dolor 坐在 amet consectetur，adipisicing 精英。

我認為錯誤來自條帶格式。 我該如何解決？

Answer 1

解析 HTML 並非易事（即使對於 HTML 的子集）。 如果 regex 感覺是這個任務的一個很好的解決方案，它實際上並不是那么好。 要解析 HTML，您應該使用... HTML 解析器。 在 C# 中， AngleSharp和HTMLAgilityPack是最常見的解決方案。 以下是 AngleSharp 的示例：

using System;
using AngleSharp;
using AngleSharp.Html.Parser;

class MyClass {
    static void Main() {
        //Use the default configuration for AngleSharp
        var config = Configuration.Default;

        //Create a new context for evaluating webpages with the given config
        var context = BrowsingContext.New(config);

        //Source to be parsed
        var source = @"<h3> This is a simple title </h3>
</br>
<p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>
";

        //Create a parser to specify the document to load (here from our fixed string)
        var parser = context.GetService<IHtmlParser>();
        var document = parser.ParseDocument(source);

        //Do something with document like the following
        Console.WriteLine(document.DocumentElement.TextContent);
    }
}

在線試用

使用 c# 將 html 轉換為純文本

問題描述

1 個解決方案

解決方案1
2 2021-12-25 13:14:49

使用 c# 將 html 轉換為純文本

問題描述

1 個解決方案

解決方案1 2 2021-12-25 13:14:49

解決方案1
2 2021-12-25 13:14:49