简体   繁体   中英

Match nested HTML tags

In a C# app, I want to match every HTML "font" tag with "color" attribute.

I have the following text:

1<font color="red">2<font color="blue">3</font>4</font>56

And I want a MatchCollection containing the following items:

[0] <font color="red">234</font>
[1] <font color="blue">3</font>

But when I use this code:

Regex.Matches(result, "<font color=\"(.*)\">(.*)</font>");

The MatchCollection I get is the following one:

[0] <font color="red">2<font color="blue">3</font>4</font>

How can I get the MatchCollection I want using C#?

Thanks.

Regex on "HTML" is an antipattern. Just don't do it.

To steer you on the right path, look at what you can do with HTML Agility Pack :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"1<font color=""red"">2<font color=""blue"">3</font>4</font>56");
var fontElements = doc.DocumentNode.Descendants("font");
var newNodes = fontElements.Select(fe => {
    var newNode = fe.Clone();
    newNode.InnerHtml = fe.InnerText;
    return newNode;
});
var collection = newNodes.Select(n => n.OuterHtml);

Now, in collection we have the following strings:

<font color="red">234</font> 
<font color="blue">3</font> 

mmm... lovely.

Matches m = Regex.Matches(result, "<font color=\"(.*?)\">(.*?)</font>");
//add a ? after the * and print the result .you will know how to get it.

A way with Html Agility Pack and a XPath query to ensure that the color attribute is present:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
String html = "1<font color=\"red\">2<font color=\"blue\">3</font>4</font>56";
htmlDoc.LoadHtml(html);
HtmlNodeCollection fontTags = htmlDoc.DocumentNode.SelectNodes(".//font[@color]");
foreach (HtmlNode fontTag in fontTags)
{
    Console.WriteLine(fontTag.InnerText);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM