在 C# 中的 HTML 字符串中搜索純文本字符串的最佳方法？

Question

這是 html 字符串：

 string htmlString = "<body lang=\"EN-US\" link=\"blue\" vlink=\"#954F72\"><div class=\"WordSection1\"><p class=\"MsoNormal\">Hi, </p><p class=\"MsoNormal\"><o:p>&nbsp;</o:p></p><p class=\"MsoNormal\"><o:p>&nbsp;</o:p></p><p class=\"MsoNormal\">My name is Gaurav Illness.</p><p class=\"MsoNormal\"><span style=\"color:purple !important\">Today <b>MY&nbsp;&nbsp;&nbsp;&nbsp; relation</b>ship breakdown <span style=\"color:red\">happened?<o:p></o:p></span> </span></p><p class=\"MsoNormal\"><span style=\"color:red\"><o:p>&nbsp;</o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p><p class=\"MsoNormal\"><o:p>&nbsp;</o:p></p><p class=\"MsoNormal\" style=\"line-height:16.5pt\"><span style=\"font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:#1F497D\">Thanks<span style=\"text-transform:uppercase\">,<o:p></o:p>"

我正在使用此功能從中提取純文本：

     private static string extractTextFromHtml(string htmlString)
    {
        // Remove new lines since they are not visible in HTML
        html = html.Replace("\n", " ");

        // Remove tab spaces
        html = html.Replace("\t", " ");

        // Remove multiple white spaces from HTML
        html = Regex.Replace(html, "\\s+", " ");

        // Remove HEAD tag
        html = Regex.Replace(html, "<head.*?</head>", ""
                            , RegexOptions.IgnoreCase | RegexOptions.Singleline);

        // Remove any JavaScript
        html = Regex.Replace(html, "<script.*?</script>", ""
          , RegexOptions.IgnoreCase | RegexOptions.Singleline);

        // Replace special characters like &, <, >, " etc.
        StringBuilder sbHTML = new StringBuilder(html);
        // Note: There are many more special characters, these are just
        // most common. You can add new characters in this arrays if needed
        string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;", "&gt;", "&reg;", "&copy;", "&bull;", "&trade;"};
        string[] NewWords = { " ", "&", "\"", "<", ">", "Â®", "Â©", "â€¢", "â„¢" };
        for (int i = 0; i < OldWords.Length; i++)
        {
            sbHTML.Replace(OldWords[i], NewWords[i]);
        }

        // Check if there are line breaks (<br>) or paragraph (<p>)
        sbHTML.Replace("<br>", "\n<br>");
        sbHTML.Replace("<br ", "\n<br ");
        sbHTML.Replace("<p ", "\n<p ");

        // Finally, remove all HTML tags and return plain text
        return System.Text.RegularExpressions.Regex.Replace(
          sbHTML.ToString(), "<[^>]*>", "");
    }

此函數返回：

“你好，

我的名字是高拉夫病。 今天我的關系破裂了嗎？

我是 GriESh，我是毒販。

謝謝，”

現在我將此文本發送到一個 API，該 API 檢測天氣是否在這些句子中存在情緒。 API 給出了所有情緒化句子的響應。 例如，API 說“今天發生了我的關系破裂？” 是情緒化的。 現在我想在 html 中將此句子標記為紫色，為此我必須在句子周圍添加一個跨度。 為此，我必須在 html 代碼中找到這句話的開始和結束索引。

如何在html代碼中找到這句話的開始和結束索引？

我有一個代碼可以給我索引，但我認為這不是最好的方法。 任何人都可以提出更好的方法嗎？ 這是我的代碼示例：

     public static void findTextInHtml(string htmlCode)
    {
        string textToBeFind = "I am GriESh and IAm drugger.";
        int i = 0;
        int j = 0;
        int startIndex = 0;
        int endIndex = 0;
        bool isHtml = false;
        bool isbeingMatched = false;
        while (i < htmlCode.Length)
        {
            if (htmlCode[i] == '<')
            {
                isHtml = true;
                i++;
                continue;
            }
            if (htmlCode[i] == '>')
            {
                isHtml = false;
                i++;
                continue;
            }
            if (isHtml)
            {
                i++;
                continue;
            }
            if (textToBeFind[j] == htmlCode[i])
            {
                if (!isbeingMatched)
                {
                    startIndex = i;
                }
                isbeingMatched = true;
                j++;
                if (j == textToBeFind.Length)
                {
                    endIndex = i;
                    break;
                }
            }
            else
            {
                isbeingMatched = false;
                j = 0;
            }
            i++;
        }
        AddStartSpan(startIndex, htmlCode);
        AddEndSpan(endIndex, htmlCode);
    }

Answer 1

安裝 nuget 包HtmlAgilityPack
然后很容易像這樣解析：

string htmlString = "<p class=\"MsoNormal\"><span style=\"color:red\"><o:p>&nbsp;</o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlString);

var inner = doc.DocumentNode.InnerText.TrimStartString("&nbsp;");
// inner = I am GriESh and IAm drugger.

刪除nbsp; 在 InnerText 的開頭

public static string TrimStartString(this string input, string prefixToRemove,
    StringComparison comparisonType = StringComparison.OrdinalIgnoreCase)
{
    if (input != null && prefixToRemove != null
      && input.StartsWith(prefixToRemove, comparisonType))
    {
        return input.Substring(prefixToRemove.Length);
    }
    else return input;
}

在 C# 中的 HTML 字符串中搜索純文本字符串的最佳方法？

問題描述

1 個解決方案

解決方案1
0 已采納 2020-01-09 17:56:47

在 C# 中的 HTML 字符串中搜索純文本字符串的最佳方法？

問題描述

1 個解決方案

解決方案1 0 已采納 2020-01-09 17:56:47

解決方案1
0 已采納 2020-01-09 17:56:47