Get “Title” attribute from html link using Regex

Question

I have the following Regex to match all link tags on a page generated from our custom cms

<a\s+((?:(?:\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|"[^"]*"|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?)>.+?</a>

We are using c# to loop through all matches of this and add an onclick event to each link (for tracking software) before rendering the page content. I need to parse the link and add a parameter to the onclick function which is the "link name".

I was going to modify the regex to get the following subgroups

The title attribute of the link
If the link contains an image tag get the alt text of the image
The text of the link

I can then check the match of each subgroup to aqquire the relevant name of the link.

How would I modify the above regex to do this or could I achieve the same think using c# code?

Answer 1

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the HTMLAgilityPack answer .

Answer 2

Try this:

Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");

A couple of gotchas:

This will match is case-sensitive, you may want to adjust that
This expects the title attribute both exists and is quoted
- Of course, if the title attribute doesn't exist, you probably don't want the match anyway?

To Extract, use the groups collection:

reg.Match("<a href=\"#\" title=\"Hello\">Howdy</a>").Groups[1].Value

Answer 3

Thanks to Chaos. Owens for pointing me towards the HtmlAgilityPack library its great. in the end I used it to sort out my problem as below. I would defiantly recommend this library to others.

   HtmlDocument htmldoc = new HtmlDocument();
    htmldoc.LoadHtml(content);
    HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[@href]");
    if (linkNodes != null)
    {
        foreach (HtmlNode linkNode in linkNodes)
        {
            string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
            //If no title attribute exists check for an image alt tag
            if (linkTitle == string.Empty)
            {
                HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
                if (imageNode != null)
                {
                    linkTitle = imageNode.GetAttributeValue("alt", string.Empty);
                }
            }
            //If no image alt tag check for span with text
            if (linkTitle == string.Empty)
            {
                HtmlNode spanNode = linkNode.SelectSingleNode("span");
                if (spanNode != null)
                {
                    linkTitle = spanNode.InnerText;
                }
            }

            if (linkTitle == string.Empty)
            {
                if (!linkNode.HasChildNodes)
                {
                    linkTitle = linkNode.InnerText;
                }
            }

        }
    }

Get “Title” attribute from html link using Regex

Question

3 answers

solution1
6 ACCPTED 2009-05-12 15:37:17

solution2
2 2009-05-12 15:41:43

solution3
0 2009-05-13 16:40:40

Get “Title” attribute from html link using Regex

Question

3 answers

solution1 6 ACCPTED 2009-05-12 15:37:17

solution2 2 2009-05-12 15:41:43

solution3 0 2009-05-13 16:40:40

solution1
6 ACCPTED 2009-05-12 15:37:17

solution2
2 2009-05-12 15:41:43

solution3
0 2009-05-13 16:40:40