简体   繁体   English

在c#中解析HTML部分

[英]Parsing sections of HTML in c#

I need to parse sections from a string of HTML. 我需要从一串HTML中解析部分。 For example: 例如:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>

Parsing the quote section should return: 解析引用部分应该返回:

<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>

Currently I'm using a regular expression to grab the content inside [section=quote]...[/section], but since the sections are entered using a WYSIWYG editor, the section tags themselves get wrapped in a paragraph tag, so the parsed result is: 目前我正在使用正则表达式来获取[section = quote] ... [/ section]中的内容,但由于这些部分是使用WYSIWYG编辑器输入的,因此部分标记本身包含在段落标记中,因此解析结果是:

</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>

The Regular Expression I'm using currently is: 我目前正在使用的正则表达式是:

\[section=(.+?)\](.+?)\[/section\]

And I'm also doing some additional cleanup prior to parsing the sections: 在解析各个部分之前,我还要做一些额外的清理工作:

protected string CleanHtml(string input) {
    // remove whitespace
    input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
    // remove empty p elements
    input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
    return input;
}

Can anyone provide a regular expression that would achieve what I am looking for or am I wasting my time trying to do this with Regex? 任何人都可以提供一个能够实现我正在寻找的正则表达式,还是我在浪费时间尝试使用正则表达式? I've seen references to the Html Agility Pack - would this be better for something like this? 我见过对Html Agility Pack的引用 - 对于这样的事情会更好吗?

[Update] [更新]

Thanks to Oscar I have used a combination of the HTML Agility pack and Regex to parse the sections. 感谢Oscar我使用了HTML Agility pack和Regex的组合来解析这些部分。 It still needs a bit of refining but it's nearly there. 它仍然需要一点精炼,但它几乎就在那里。

public void ParseSections(string content)
{
    this.SourceContent = content;
    this.NonSectionedContent = content;

    content = CleanHtml(content);

    if (!sectionRegex.IsMatch(content))
        return;

    var doc = new HtmlDocument();
    doc.LoadHtml(content);

    bool flag = false;
    string sectionName = string.Empty;
    var sectionContent = new StringBuilder();
    var unsectioned = new StringBuilder();

    foreach (var n in doc.DocumentNode.SelectNodes("//p")) {               
        if (startSectionRegex.IsMatch(n.InnerText)) { 
            flag = true;
            sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
            continue;
        }
        if (endSectionRegex.IsMatch(n.InnerText)) {
            flag = false;
            this.Sections.Add(sectionName, sectionContent.ToString());
            sectionContent.Clear();
            continue;
        }

        if (flag)
            sectionContent.Append(n.OuterHtml);
        else
            unsectioned.Append(n.OuterHtml);
    }

    this.NonSectionedContent = unsectioned.ToString();
}

The following works, using HtmlAgilityPack library: 以下工作,使用HtmlAgilityPack库:

using HtmlAgilityPack;

... ...

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\file.html");


bool flag = false;
var sb = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p"))
{
    switch (n.InnerText)
    {
        case "[section=quote]":
            flag = true;
            continue;
        case "[/section]":
            flag = false;
            break;
    }
    if (flag)
    {
        sb.AppendLine(n.OuterHtml);
    }
}

Console.Write(sb);
Console.ReadLine();

If you just want to print Mauris at turpis nec dolor bibendum sollicitudin ac quis neque. 如果你只是想Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.上打印Mauris at turpis nec dolor bibendum sollicitudin ac quis neque. without <p>...</p> , you can replace n.OuterHtml by n.InnerHtml . 如果没有<p>...</p> ,您可以用n.OuterHtml替换n.InnerHtml

Of course, you should check if doc.DocumentNode.SelectNodes("//p") is null . 当然,您应该检查doc.DocumentNode.SelectNodes("//p")是否为null
If you want to load the html from an online source instead of a file, you can do: 如果要从在线源而不是文件加载html,可以执行以下操作:

var htmlWeb = new HtmlWeb();  
var doc = htmlWeb.Load("http://..../page.html");

Edit: 编辑:

If [section=quote] an [/section] could be inside any tag (not always <p> ), you can replace doc.DocumentNode.SelectNodes("//p") by doc.DocumentNode.SelectNodes("//*") . 如果[section=quote] [/section]可以在任何标记内(并非总是<p> ),则可以用doc.DocumentNode.SelectNodes("//p")替换doc.DocumentNode.SelectNodes("//p") doc.DocumentNode.SelectNodes("//*")

How about replacing 如何更换

<p>[section=quote]</p>

with

[section=quote]

and

<p>[/section]</p>

with

[/section]

as part of your cleanup. 作为清理的一部分。 Then you can use your existing regular expression. 然后,您可以使用现有的正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM