简体   繁体   中英

Use predicates in HtmlAgilityPack, Xpath

I want to fetch data from website. I am using HtmlAgilityPack (C#). In the website content is like this

<div id="list">
  <div class="list1">
    <a href="example1.com" class="href1" >A1</a>
    <a href="example4.com" class="href2" />
  </div>
  <div class="list2">
   <a href="example2.com" class="href1" >A2</a>
   <a href="example5.com" class="href2" />
  </div>
  <div class="list3">
   <a href="example3.com" class="href1" >A3</a>
   <a href="example6.com" class="href2" />
  </div>
  <div class="list3">
   <a href="example4.com" class="href1" >A4</a>
   <a href="example6.com" class="href2" />
  </div>
  <div class="list3">
   <a href="example5.com" class="href1" >A5</a>
   <a href="example6.com" class="href2" />
  </div><div class="list3">
   <a href="example6.com" class="href1" >A6</a>
   <a href="example6.com" class="href2" />
  </div><div class="list3">
   <a href="example3.com" class="href1" >A7</a>
   <a href="example6.com" class="href2" />
  </div>
</div>

Here, we have 7 links with class="href1". I want to fetch only 3 links (from 3rd link to 5th link). How to fetch these particular links?

This kind of code:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(myHtmlFile);
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes(
        "//div[@class='list3' and position() > 2 and position() < 6]/a[@class='href1']"))
    {
        Console.WriteLine("node:" + node.InnerText);
    }

will give you this result:

node:A3
node:A4
node:A5

Your data already appears to be well-formed XML. If you're parsing XHTML pages, then you could probably get away with the System.Xml classes of the .NET Framework. For example, to load your data into an XElement , you could use:

XElement xElement = XElement.Parse(@"
    <div id=""list"">
        <div class=""list1"">
            <a href=""example1.com"" class=""href1"" >A1</a>
            <a href=""example4.com"" class=""href2"" />
        </div>
        <div class=""list2"">
            <a href=""example2.com"" class=""href1"" >A2</a>
            <a href=""example5.com"" class=""href2"" />
        </div>
        <div class=""list3"">
            <a href=""example3.com"" class=""href1"" >A3</a>
            <a href=""example6.com"" class=""href2"" />
        </div>
        <div class=""list3"">
            <a href=""example4.com"" class=""href1"" >A4</a>
            <a href=""example6.com"" class=""href2"" />
        </div>
        <div class=""list3"">
            <a href=""example5.com"" class=""href1"" >A5</a>
            <a href=""example6.com"" class=""href2"" />
        </div>
        <div class=""list3"">
            <a href=""example6.com"" class=""href1"" >A6</a>
            <a href=""example6.com"" class=""href2"" />
        </div>
        <div class=""list3"">
            <a href=""example3.com"" class=""href1"" >A7</a>
            <a href=""example6.com"" class=""href2"" />
        </div>
    </div>");

Then, to select the third to fifth <a> elements whose class attribute has a value of href1 , use:

var links = xElement.XPathSelectElements("//a[@class='href1']").Skip(2).Take(3).ToList();

If, on the other hand, you have an HtmlAgilityPack.HtmlDocument instance, you could execute an XPath query using:

HtmlNodeCollection links = htmlDoc.DocumentNode.SelectNodes("//a[@class='href1']");
var links3to5 = links.Cast<HtmlNode>().Skip(2).Take(3).ToList();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM