简体   繁体   中英

Capturing the rel type and href of links in c#

I have a string that should contain a list of items in the form , {0}, {1}, and {2} are strings and I want to basically extract them.

I do want to do this for part of an html parsing problem, and I have heard that parsing html with regular expressions is bad. (Like here )

I am not even sure how to do this with regular expressions.

This is as far as I got

string format = "<link rel=\".*\" type=\".*\" href=\".*\">";
Regex reg = new Regex(format);
MatchCollection matches = reg.Matches(input, 0);
foreach (Match match in matches)
 {
        string rel = string.Empty;
        string type = string.Empty;
        string href = string.Empty;
        //not sure what to do here to get these values for each from the match
 }

Before my research turned up that I might be completely on the wrong track using regular expressions.

How would you do this either with the method I chose or with an HTML parser?

使用HTML Agility包库解析HTML,可在此处找到

You'd be better off using a real HTML parser like the Html Agility Pack. You can get it here .

A main reason for not using regular expressions for HTML parsing is because it might not be well-formed (almost always the case), which could break your regular expression parser.

You would then use XPath to get the nodes you need and load them into variables.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(pageMarkup);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//link");
string rel;

if(nodes[0].Attributes["rel"] != null)
{
    rel = nodes[0].Attributes["rel"]; 
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM