Capturing the rel type and href of links in c#

Question

I have a string that should contain a list of items in the form , {0}, {1}, and {2} are strings and I want to basically extract them.

I do want to do this for part of an html parsing problem, and I have heard that parsing html with regular expressions is bad. (Like here )

I am not even sure how to do this with regular expressions.

This is as far as I got

string format = "<link rel=\".*\" type=\".*\" href=\".*\">";
Regex reg = new Regex(format);
MatchCollection matches = reg.Matches(input, 0);
foreach (Match match in matches)
 {
        string rel = string.Empty;
        string type = string.Empty;
        string href = string.Empty;
        //not sure what to do here to get these values for each from the match
 }

Before my research turned up that I might be completely on the wrong track using regular expressions.

How would you do this either with the method I chose or with an HTML parser?

Answer 1

使用HTML Agility包库解析HTML，可在此处找到

Answer 2

You'd be better off using a real HTML parser like the Html Agility Pack. You can get it here .

A main reason for not using regular expressions for HTML parsing is because it might not be well-formed (almost always the case), which could break your regular expression parser.

You would then use XPath to get the nodes you need and load them into variables.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(pageMarkup);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//link");
string rel;

if(nodes[0].Attributes["rel"] != null)
{
    rel = nodes[0].Attributes["rel"]; 
}

Capturing the rel type and href of links in c#

Question

2 answers

solution1
1 2009-06-18 18:59:31

solution2
0 ACCPTED 2009-06-18 19:12:46

Capturing the rel type and href of links in c#

Question

2 answers

solution1 1 2009-06-18 18:59:31

solution2 0 ACCPTED 2009-06-18 19:12:46

solution1
1 2009-06-18 18:59:31

solution2
0 ACCPTED 2009-06-18 19:12:46