简体   繁体   中英

Get “Title” attribute from html link using Regex

I have the following Regex to match all link tags on a page generated from our custom cms

<a\s+((?:(?:\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|"[^"]*"|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?)>.+?</a>

We are using c# to loop through all matches of this and add an onclick event to each link (for tracking software) before rendering the page content. I need to parse the link and add a parameter to the onclick function which is the "link name".

I was going to modify the regex to get the following subgroups

  • The title attribute of the link
  • If the link contains an image tag get the alt text of the image
  • The text of the link

I can then check the match of each subgroup to aqquire the relevant name of the link.

How would I modify the above regex to do this or could I achieve the same think using c# code?

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the HTMLAgilityPack answer .

Try this:

Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");

A couple of gotchas:

  • This will match is case-sensitive, you may want to adjust that
  • This expects the title attribute both exists and is quoted
    • Of course, if the title attribute doesn't exist, you probably don't want the match anyway?

To Extract, use the groups collection:

reg.Match("<a href=\"#\" title=\"Hello\">Howdy</a>").Groups[1].Value

Thanks to Chaos. Owens for pointing me towards the HtmlAgilityPack library its great. in the end I used it to sort out my problem as below. I would defiantly recommend this library to others.

   HtmlDocument htmldoc = new HtmlDocument();
    htmldoc.LoadHtml(content);
    HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[@href]");
    if (linkNodes != null)
    {
        foreach (HtmlNode linkNode in linkNodes)
        {
            string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
            //If no title attribute exists check for an image alt tag
            if (linkTitle == string.Empty)
            {
                HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
                if (imageNode != null)
                {
                    linkTitle = imageNode.GetAttributeValue("alt", string.Empty);
                }
            }
            //If no image alt tag check for span with text
            if (linkTitle == string.Empty)
            {
                HtmlNode spanNode = linkNode.SelectSingleNode("span");
                if (spanNode != null)
                {
                    linkTitle = spanNode.InnerText;
                }
            }

            if (linkTitle == string.Empty)
            {
                if (!linkNode.HasChildNodes)
                {
                    linkTitle = linkNode.InnerText;
                }
            }

        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM