I have the following Regex to match all link tags on a page generated from our custom cms
<a\s+((?:(?:\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|"[^"]*"|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?)>.+?</a>
We are using c# to loop through all matches of this and add an onclick event to each link (for tracking software) before rendering the page content. I need to parse the link and add a parameter to the onclick function which is the "link name".
I was going to modify the regex to get the following subgroups
I can then check the match of each subgroup to aqquire the relevant name of the link.
How would I modify the above regex to do this or could I achieve the same think using c# code?
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you may be interested in the HTMLAgilityPack answer .
Try this:
Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");
A couple of gotchas:
To Extract, use the groups collection:
reg.Match("<a href=\"#\" title=\"Hello\">Howdy</a>").Groups[1].Value
Thanks to Chaos. Owens for pointing me towards the HtmlAgilityPack library its great. in the end I used it to sort out my problem as below. I would defiantly recommend this library to others.
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
//If no title attribute exists check for an image alt tag
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
if (imageNode != null)
{
linkTitle = imageNode.GetAttributeValue("alt", string.Empty);
}
}
//If no image alt tag check for span with text
if (linkTitle == string.Empty)
{
HtmlNode spanNode = linkNode.SelectSingleNode("span");
if (spanNode != null)
{
linkTitle = spanNode.InnerText;
}
}
if (linkTitle == string.Empty)
{
if (!linkNode.HasChildNodes)
{
linkTitle = linkNode.InnerText;
}
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.