简体   繁体   中英

In c# how to get unique list all tags using html agility pack

How can I get an unique list of all tags from a html string. But I am able to extract the tags one by one only.

Code

public static void HtmlParser()
{
    string html = @"<TD >
    <DIV align=right>Name :<B> </B></DIV></TD>
    <TD width=""50%"">
        <INPUT class=box value=John maxLength=16 size=16 name=user_name>
    </TD>
    <TR vAlign=center> <code> This is a <kwd>vba</kwd> code piece</code>  Hi I am sujoy";

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string code = htmlDoc.DocumentNode.
    .SelectSingleNode("//code").InnerHtml;
    string TD = htmlDoc.DocumentNode
    .SelectSingleNode("//TD").InnerText;
}

For the above code I want the output to be a list of {"DIV","TD","TR","CODE"}

Not sure exactly what you mean by "an unique list of all tags from a html string".

If you want every element in the HTML document, use:

htmlDoc.DocumentNode.Descendants();

If you want a list of all <code> tags, one way to to achieve that is using LINQ:

htmlDoc.DocumentNode.Descendants().Where(d => d.Name == "code");

Edit:

A list of all unique tags can be retrieved this way, for example:

htmlDoc.DocumentNode.Descendants().Where(d => !d.Name.StartsWith("#")).Select(d => d.Name).GroupBy(d => d).Select(g => g.Key)

This uses LINQ to go through the following steps:

  1. Remove descendants beginning with '#' (comments, text, etc.), leaving only the tags.
  2. Select tag names only (so you'll get it as strings, as requested)
  3. Group by tag name (so you'll only get one of each)
  4. Select the keys (the unique tag names)

Use htmlDoc.DocumentNode.Descendants() and for unique list use HashSet :

public static void HtmlParser()
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml("Your html string containing tags like <div></div>...");
    HashSet<string> hs = new HashSet<string>();
    foreach(var dec in htmlDoc.DocumentNode.Descendants())
    {
        hs.Add (dec.Name);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM