C# Html Agility Pack parsing tags with multiple alternatives

Question

I don't have any experience with HTML, so excuse any incorrect terminology.

I am trying to parse an HTML document using the HTML Agility Pack, and I am looking for a very specific string.

I want to obtain all strings of the form:

<img src="..." etc=....">

So my select parameter is

HtmlNodeCollection images = doc.DocumentNode.SelectNodes("//img[@src]");

However, this also ends up returning strings such as

<img width="..." src="..." etc="..">

It seems to me (at least to the best of my knowledge): The img tag is searched for and src only needs to be found on the same level, not necessarily right next to the img tag.

After looking at the documentation I feel that I am trying to do something I am not allowed to with this function.

Can someone please suggest the correct way to do this. Thanks!

Answer 1

" The img tag is searched for and src only needs to be found on the same level, not necessarily right next to the img tag . "

It seems that you want to find <img> element where src attributes is the first attribute. Notice that XML/HTML parser doesn't have to preserve attributes order, so generally you don't want to select element based on certain attribute order ie where src attribute comes first, etc.

Anyway, attributes order happen to be preserved by HAP in my oversimplified test, hence using Attributes[0].Name * to check the name of the first attribute also worked :

var raw = @"<div>
    <img src=""..."" etc=""...."">
    <img width=""..."" src=""..."" etc="".."">
    <img>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
var result = doc.DocumentNode
                .SelectNodes("//img[@src]")
                .Where(o => o.Attributes[0].Name == "src")
                .ToList();
foreach (var item in result)
{
    Console.WriteLine(item.OuterHtml);
}

output :

<img src="..." etc="....">

*) The XPath already filter img elements that has attribute src , so Attributes[0].Name would never produce NRE, if you are concerned.

Answer 2

I am not familiar with XPATH, so I am assuming yours is correct (I usually use css selectors using ScrapySharp library in addition to HtmlAgilityPack).

The following Console project code snippet will return only the img node you want, ie, the one with 2 attributes only - src and etc, not less not more. I manually load a sample html with 3 image nodes, like the following:

        HtmlDocument doc = new HtmlDocument();
        string html = @"
            <img src='img1.jpg' />
            <img src='img1.jpg' etc='etcValue' />
            <img width='200px' src='img1.jpg' />
        ";
        doc.LoadHtml(html);

        var relevantImgNodes = doc.DocumentNode.SelectNodes("//img")
            .Where(n => 
                n.Attributes.Count == 2 && 
                !string.IsNullOrEmpty(n.GetAttributeValue("src")) && 
                !string.IsNullOrEmpty(n.GetAttributeValue("etc")));

        Console.WriteLine(relevantImgNodes.Count()); // prints 1

C# Html Agility Pack parsing tags with multiple alternatives

Question

2 answers

solution1
1 ACCPTED 2016-05-10 10:17:35

solution2
0 2016-05-10 10:07:56

C# Html Agility Pack parsing tags with multiple alternatives

Question

2 answers

solution1 1 ACCPTED 2016-05-10 10:17:35

solution2 0 2016-05-10 10:07:56

solution1
1 ACCPTED 2016-05-10 10:17:35

solution2
0 2016-05-10 10:07:56