如何在 C# 中將 HTML 轉換為文本？

Question

我正在尋找將 HTML 文檔轉換為純文本的 C# 代碼。

我不是在尋找簡單的標簽剝離，而是在合理保留原始布局的情況下輸出純文本的東西。

輸出應如下所示：

我看過 HTML Agility Pack，但我認為這不是我需要的。 有沒有人有其他建議？

編輯：我只是從CodePlex下載 HTML Agility Pack，然后運行 Html2Txt 項目。 多么令人失望（至少是將 html 轉換為文本的模塊）！ 它所做的只是剝離標簽、展平表格等。輸出看起來與 Html2Txt @ W3C 生成的完全不同。 太糟糕了，源似乎不可用。 我想看看是否有更“罐頭”的解決方案可用。

編輯2：謝謝大家的建議。 FlySwat把我引向了我想去的方向。我可以使用System.Diagnostics.Process類運行帶有“-dump”開關的 lynx.exe 以將文本發送到標准輸出，並使用ProcessStartInfo.UseShellExecute = false和ProcessStartInfo.RedirectStandardOutput = true捕獲標准輸出。 我將把所有這些都封裝在一個 C# 類中。 這段代碼只會偶爾被調用，所以我不太關心產生新進程與在代碼中執行它。 另外，Lynx 很快！！

Answer 1

只是為了后代的 HtmlAgilityPack 的注意事項。 該項目包含一個將文本解析為 html的示例，正如 OP 所指出的那樣，它根本不像編寫 HTML 的任何人所設想的那樣處理空格。 那里有全文渲染解決方案，其他人對此問題指出，這不是（它甚至無法處理當前形式的表格），但它輕巧且快速，這就是我想要創建簡單文本的全部內容版本的 HTML 電子郵件。

using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{

    public static string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);
        return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        return ConvertDoc(doc);
    }

    public static string ConvertDoc (HtmlDocument doc)
    {
        using (StringWriter sw = new StringWriter())
        {
            ConvertTo(doc.DocumentNode, sw);
            sw.Flush();
            return sw.ToString();
        }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        foreach (HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText, textInfo);
        }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText)
    {
        ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;
            case HtmlNodeType.Document:
                ConvertContentTo(node, outText, textInfo);
                break;
            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                {
                    break;
                }
                // get text
                html = ((HtmlTextNode)node).Text;
                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                {
                    break;
                }
                // check the text is meaningful and not a bunch of whitespaces
                if (html.Length == 0)
                {
                    break;
                }
                if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
                {
                    html= html.TrimStart();
                    if (html.Length == 0) { break; }
                    textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
                }
                outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
                if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
                {
                    outText.Write(' ');
                }
                    break;
            case HtmlNodeType.Element:
                string endElementString = null;
                bool isInline;
                bool skip = false;
                int listIndex = 0;
                switch (node.Name)
                {
                    case "nav":
                        skip = true;
                        isInline = false;
                        break;
                    case "body":
                    case "section":
                    case "article":
                    case "aside":
                    case "h1":
                    case "h2":
                    case "header":
                    case "footer":
                    case "address":
                    case "main":
                    case "div":
                    case "p": // stylistic - adjust as you tend to use
                        if (textInfo.IsFirstTextOfDocWritten)
                        {
                            outText.Write("\r\n");
                        }
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "br":
                        outText.Write("\r\n");
                        skip = true;
                        textInfo.WritePrecedingWhiteSpace = false;
                        isInline = true;
                        break;
                    case "a":
                        if (node.Attributes.Contains("href"))
                        {
                            string href = node.Attributes["href"].Value.Trim();
                            if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
                            {
                                endElementString =  "<" + href + ">";
                            }  
                        }
                        isInline = true;
                        break;
                    case "li": 
                        if(textInfo.ListIndex>0)
                        {
                            outText.Write("\r\n{0}.\t", textInfo.ListIndex++); 
                        }
                        else
                        {
                            outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
                        }
                        isInline = false;
                        break;
                    case "ol": 
                        listIndex = 1;
                        goto case "ul";
                    case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "img": //inline-block in reality
                        if (node.Attributes.Contains("alt"))
                        {
                            outText.Write('[' + node.Attributes["alt"].Value);
                            endElementString = "]";
                        }
                        if (node.Attributes.Contains("src"))
                        {
                            outText.Write('<' + node.Attributes["src"].Value + '>');
                        }
                        isInline = true;
                        break;
                    default:
                        isInline = true;
                        break;
                }
                if (!skip && node.HasChildNodes)
                {
                    ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
                }
                if (endElementString != null)
                {
                    outText.Write(endElementString);
                }
                break;
        }
    }
}
internal class PreceedingDomTextInfo
{
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
    {
        IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace {get;set;}
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
}
internal class BoolWrapper
{
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper)
    {
        return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper)
    {
        return new BoolWrapper{ Value = boolWrapper };
    }
}

例如，以下 HTML 代碼...

<!DOCTYPE HTML>
<html>
    <head>
    </head>
    <body>
        <header>
            Whatever Inc.
        </header>
        <main>
            <p>
                Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
            </p>
            <ol>
                <li>
                    Please confirm this is your email by replying.
                </li>
                <li>
                    Then perform this step.
                </li>
            </ol>
            <p>
                Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
            </p>
            <ul>
                <li>
                    a point.
                </li>
                <li>
                    another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
                </li>
            </ul>
            <p>
                Sincerely,
            </p>
            <p>
                The whatever.com team
            </p>
        </main>
        <footer>
            Ph: 000 000 000<br/>
            mail: whatever st
        </footer>
    </body>
</html>

...將轉換為：

Whatever Inc. 


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 

1.  Please confirm this is your email by replying. 
2.  Then perform this step. 

Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: 

*   a point. 
*   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. 

Sincerely, 

The whatever.com team 


Ph: 000 000 000
mail: whatever st

...而不是：

        Whatever Inc.


            Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

                Please confirm this is your email by replying.

                Then perform this step.


            Please solve this . Then, in any order, could you please:

                a point.

                another point, with a hyperlink.


            Sincerely,


            The whatever.com team

        Ph: 000 000 000
        mail: whatever st

Answer 2

你可以用這個：

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

更新

感謝您為改進此功能而更新的評論

Answer 3

我從可靠的消息來源聽說，如果您在 .Net 中進行 HTML 解析，您應該再次查看 HTML 敏捷包。

http://www.codeplex.com/htmlagilitypack

SO上的一些示例..

HTML 敏捷包 - 解析表

Answer 4

您正在尋找的是輸出文本的文本模式 DOM 渲染器，很像 Lynx 或其他文本瀏覽器……這比您預期的要困難得多。

Answer 5

假設您有格式良好的 html，您也可以嘗試 XSL 轉換。

下面是一個例子：

using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;

class Html2TextExample
{
    public static string Html2Text(XDocument source)
    {
        var writer = new StringWriter();
        Html2Text(source, writer);
        return writer.ToString();
    }

    public static void Html2Text(XDocument source, TextWriter output)
    {
        Transformer.Transform(source.CreateReader(), null, output);
    }

    public static XslCompiledTransform _transformer;
    public static XslCompiledTransform Transformer
    {
        get
        {
            if (_transformer == null)
            {
                _transformer = new XslCompiledTransform();
                var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
                _transformer.Load(xsl.CreateNavigator());
            }
            return _transformer;
        }
    }

    static void Main(string[] args)
    {
        var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
        var text = Html2Text(html);
        Console.WriteLine(text);
    }
}

Answer 6

因為我想使用 LF 和項目符號轉換為純文本，所以我在 codeproject 上找到了這個很好的解決方案，它涵蓋了許多轉換用例：

將 HTML 轉換為純文本

是的，看起來很大，但工作正常。

Answer 7

我在使用 HtmlAgility 時遇到了一些解碼問題，我不想花時間研究它。

相反，我使用了 Microsoft Team Foundation API 中的該實用程序：

var text = HtmlFilter.ConvertToPlainText(htmlContent);

Answer 8

您是否嘗試過http://www.aaronsw.com/2002/html2text/它是 Python，但它是開源的。

Answer 9

最簡單的可能是標簽剝離結合使用文本布局元素替換某些標簽，例如列表元素 (li) 的破折號和 br 和 p 的換行符。 將其擴展到表格應該不會太難。

Answer 10

此函數將“您在瀏覽器中看到的內容”轉換為帶換行符的純文本。 （如果您想在瀏覽器中查看結果，只需使用帶注釋的返回值）

public string HtmlFileToText(string filePath)
{
    using (var browser = new WebBrowser())
    {
        string text = File.ReadAllText(filePath);
        browser.ScriptErrorsSuppressed = true;
        browser.Navigate("about:blank");
        browser?.Document?.OpenNew(false);
        browser?.Document?.Write(text);
        return browser.Document?.Body?.InnerText;
        //return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
    }   
}

Answer 11

這是使用 HtmlAgilityPack 的簡短回答。 您可以在 LinqPad 中運行它。

var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;

我只是在任何需要 HTML 解析的 .NET 項目中使用 HtmlAgilityPack。 它簡單、可靠且快速。

Answer 12

另一篇文章建議使用HTML 敏捷包：

這是一個敏捷的 HTML 解析器，它構建了一個讀/寫 DOM 並支持普通的 XPATH 或 XSLT（你實際上不必了解 XPATH 或 XSLT 來使用它，別擔心......）。 它是一個 .NET 代碼庫，允許您解析“網絡之外”的 HTML 文件。 解析器對“現實世界”格式錯誤的 HTML 非常寬容。 對象模型與 System.Xml 建議的非常相似，但用於 HTML 文檔（或流）。

Answer 13

我過去使用過Detagger 。 它在將 HTML 格式化為文本方面做得非常好，而且不僅僅是一個標簽移除器。

Answer 14

我最近在博客上寫了一個對我有用的解決方案，它使用 Markdown XSLT 文件來轉換 HTML 源。 HTML 源代碼當然首先需要是有效的 XML

Answer 15

嘗試簡單實用的方法：只需調用StripHTML(WebBrowserControl_name);

 public string StripHTML(WebBrowser webp)
        {
            try
            {
                doc.execCommand("SelectAll", true, null);
                IHTMLSelectionObject currentSelection = doc.selection;

                if (currentSelection != null)
                {
                    IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
                    if (range != null)
                    {
                        currentSelection.empty();
                        return range.text;
                    }
                }
            }
            catch (Exception ep)
            {
                //MessageBox.Show(ep.Message);
            }
            return "";

        }

Answer 16

我不知道 C#，但這里有一個相當小且易於閱讀的 python html2txt 腳本： http ://www.aaronsw.com/2002/html2text/

Answer 17

如果您使用 .NET 框架 4.5，您可以使用 System.Net.WebUtility.HtmlDecode()，它采用 HTML 編碼字符串並返回解碼字符串。

記錄在 MSDN 上： http : //msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode( v= vs.110).aspx

您也可以在 Windows 應用商店應用中使用它。

Answer 18

在 Genexus 中，您可以使用 Regex 制作

&pattern = '<[^>]+>'

&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")

在 Genexus possiamo gestirlo con Regex 中，

Answer 19

您可以使用WebBrowser控件在內存中呈現您的 html 內容。 在LoadCompleted事件觸發后...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

Answer 20

這是在 C# 中將 HTML 轉換為文本或 RTF 的另一種解決方案：

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

這個庫不是免費的，這是商業產品，它是我自己的產品。

如何在 C# 中將 HTML 轉換為文本？

問題描述

20 個解決方案

解決方案1
49 2014-08-07 09:21:24

解決方案2
35 2009-04-08 22:20:01

解決方案3
17 2009-04-08 20:33:14

解決方案4
12 已采納 2009-04-08 20:26:23

解決方案5
3 2012-06-07 02:09:20

解決方案6
3 2012-10-04 16:06:01

解決方案7
3 2015-10-18 07:43:36

解決方案8
3 2009-04-08 20:27:48

解決方案9
2 2009-04-08 20:27:28

解決方案10
0 2019-01-11 14:20:24

解決方案11
0 2020-05-14 17:01:00

解決方案12
0 2009-04-08 20:36:18

解決方案13
0 2009-04-08 22:20:25

解決方案14
-1 2009-11-05 21:01:25

解決方案15
-1 2015-06-16 12:11:20

解決方案16
-1 2009-04-08 20:28:40

解決方案17
-2 2014-10-25 23:23:55

解決方案18
-2 2011-06-01 14:47:03

解決方案19
-3 2010-09-22 06:52:37

解決方案20
-4 2009-09-29 11:23:41

如何在 C# 中將 HTML 轉換為文本？

問題描述

20 個解決方案

解決方案1 49 2014-08-07 09:21:24

解決方案2 35 2009-04-08 22:20:01

解決方案3 17 2009-04-08 20:33:14

解決方案4 12 已采納 2009-04-08 20:26:23

解決方案5 3 2012-06-07 02:09:20

解決方案6 3 2012-10-04 16:06:01

解決方案7 3 2015-10-18 07:43:36

解決方案8 3 2009-04-08 20:27:48

解決方案9 2 2009-04-08 20:27:28

解決方案10 0 2019-01-11 14:20:24

解決方案11 0 2020-05-14 17:01:00

解決方案12 0 2009-04-08 20:36:18

解決方案13 0 2009-04-08 22:20:25

解決方案14 -1 2009-11-05 21:01:25

解決方案15 -1 2015-06-16 12:11:20

解決方案16 -1 2009-04-08 20:28:40

解決方案17 -2 2014-10-25 23:23:55

解決方案18 -2 2011-06-01 14:47:03

解決方案19 -3 2010-09-22 06:52:37

解決方案20 -4 2009-09-29 11:23:41

解決方案1
49 2014-08-07 09:21:24

解決方案2
35 2009-04-08 22:20:01

解決方案3
17 2009-04-08 20:33:14

解決方案4
12 已采納 2009-04-08 20:26:23

解決方案5
3 2012-06-07 02:09:20

解決方案6
3 2012-10-04 16:06:01

解決方案7
3 2015-10-18 07:43:36

解決方案8
3 2009-04-08 20:27:48

解決方案9
2 2009-04-08 20:27:28

解決方案10
0 2019-01-11 14:20:24

解決方案11
0 2020-05-14 17:01:00

解決方案12
0 2009-04-08 20:36:18

解決方案13
0 2009-04-08 22:20:25

解決方案14
-1 2009-11-05 21:01:25

解決方案15
-1 2015-06-16 12:11:20

解決方案16
-1 2009-04-08 20:28:40

解決方案17
-2 2014-10-25 23:23:55

解決方案18
-2 2011-06-01 14:47:03

解決方案19
-3 2010-09-22 06:52:37

解決方案20
-4 2009-09-29 11:23:41