简体   繁体   English

如何从c#中的html代码获取html代码的一部分?

[英]How to get the part of the html code from html code in c#?

In My program i have used string variable content. 在我的程序中,我使用了字符串变量内容。 I have assigned a small HTML program for this string. 我为此字符串分配了一个小型HTML程序。 For Example, 例如,

String content = "<HTML> <HEAD> <TITLE>Your Title Here</TITLE></HEAD> <BODY><H2>This is a Medium Header Send me mail at<a href="mailto:support@yourcompany.com">support@yourcompany.com</a>.This is a new sentence without a paragraph break.</H2></BODY></HTML>";

From this i want to get "This is a Medium Header Send me mail at support@yourcompany.com.This is a new sentence without a paragraph break." 从中我想得到“这是一个中等标题,请发送电子邮件至support@yourcompany.com。这是一个新的句子,没有段落中断。” alone. 单独。

This string available inside the tag. 该字符串在标记内可用。 how i get this string using c#. 我如何使用C#获取此字符串。

Don't use string methods or regex to parse HTML. 不要使用字符串方法或正则表达式来解析HTML。 You can use HtmlAgilityPack . 您可以使用HtmlAgilityPack

string content = "<HTML> <HEAD> <TITLE>Your Title Here</TITLE></HEAD> <BODY><H2>This is a Medium Header Send me mail at<a href=\"mailto:support@yourcompany.com\">support@yourcompany.com</a>.This is a new sentence without a paragraph break.</H2></BODY></HTML>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
string headerText = doc.DocumentNode.Descendants("H2").First().InnerText;

Result: 结果:

This is a Medium Header Send me mail atsupport@yourcompany.com.This is a new sentence without a paragraph break.

Complete sample 完整样本

HtmlFormatHelper.cs: HtmlFormatHelper.cs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace Tools
{
    /// <summary>
    /// набор утилит для форматирования HTML текста
    /// </summary>
    public static class HtmlFormatHelper
    {
        private static Regex _regexLineBreak;
        private static Regex _regexStripFormatting;
        private static Regex _regexTagWhiteSpace;
        private static Regex _regexHyperlink;

        /// <summary>
        /// статический конструктор
        /// </summary>
        static HtmlFormatHelper()
        {
            _regexLineBreak = new Regex(@"<(br|BR|p|P)\s{0,1}\/{0,1}>\s*|</[pP]>", RegexOptions.Singleline);
            _regexStripFormatting = new Regex(@"<[^>]*(>|$)", RegexOptions.Singleline);
            _regexTagWhiteSpace = new Regex(@"(>|$)(\W|\n|\r)+<", RegexOptions.Singleline);
            _regexHyperlink = new Regex(@"<a\s+[^>]*href\s*=\s*[""']?([^""'>]+)[""']?[^>]*>([^<]+)</a>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
        }

        /// <summary>
        /// конвертировать HTML в текст
        /// </summary>
        /// <param name="html"> HTML </param>
        /// <returns></returns>
        public static string HtmlToPlainText(string html)
        {
            var text = html;

            text = System.Net.WebUtility.HtmlDecode(text);
            text = _regexTagWhiteSpace.Replace(text, "><");
            text = _regexLineBreak.Replace(text, Environment.NewLine);
            text = _regexStripFormatting.Replace(text, string.Empty);

            return text;
        }

        /// <summary>
        /// конвертировать HTML в текст с "умным" оформлением
        /// </summary>
        /// <param name="html"> HTML </param>
        /// <returns></returns>
        public static string HtmlToPlainTextSmart(string html)
        {
            // обрабатываем ссылки
            html = _regexHyperlink.Replace(html, e =>
            {
                string url = e.Groups[1].Value.Trim();
                string text = e.Groups[2].Value.Trim();

                if (url.Length == 0 || string.Equals(url, text, StringComparison.InvariantCultureIgnoreCase))
                {
                    // ссылки идентичны или ссылка отсутствует
                    return e.Value;
                }
                else
                {
                    // ссылки отличаются
                    return string.Format("{0} ({1})", text, url);
                }
            });

            return HtmlToPlainText(html);
        }

        /// <summary>
        /// кодировать HTML код с "мягком" режиме
        /// </summary>
        /// <param name="html"> HTML </param>
        /// <returns></returns>
        public static string SoftHtmlEncode(string html)
        {
            if (html == null)
            {
                return null;
            }
            else
            {
                StringBuilder sb = new StringBuilder(html.Length);

                foreach (char c in html)
                {
                    if (c == '<')
                    {
                        sb.Append("&lt;");
                    }
                    else if (c == '>')
                    {
                        sb.Append("&gt;");
                    }
                    else
                    {
                        sb.Append(c);
                    }
                }

                return sb.ToString();
            }
        }
    }
}

How to use: 如何使用:

// input string
string content = "<HTML> <HEAD> <TITLE>Your Title Here</TITLE></HEAD> <BODY><H2>This is a Medium Header Send me mail at<a href=\"mailto:support@yourcompany.com\">support@yourcompany.com</a>.This is a new sentence without a paragraph break.</H2></BODY></HTML>";

// extract html body
string htmlBody = Regex.Match(content, @"^.*?<body>(.*)</body>.*?$", RegexOptions.IgnoreCase).Groups[1].Value;

// plain text
string plainText = Tools.HtmlFormatHelper.HtmlToPlainText(htmlBody);
//: This is a Medium Header Send me mail atsupport@yourcompany.com.This is a new sentence without a paragraph break.

// plain text (with url in brackets)
string plainTextSmart = Tools.HtmlFormatHelper.HtmlToPlainTextSmart(htmlBody);
//: This is a Medium Header Send me mail atsupport@yourcompany.com (mailto:support@yourcompany.com).This is a new sentence without a paragraph break.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM