简体   繁体   中英

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information This is how the order is received

new-link-1

new-link-2

new-link-3

But when it comes to the new-link-no-title section, it breaks up And it changes to

new-link-3

new-link-1

new-link-2

And at the end of the program it stops with an ArgumentOutOfRangeException error

HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = await web.LoadFromWebAsync(Link);


    foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[@class='new-link-1']").WithIndex())
    {

        var x = item.SelectNodes("//div[@class='new-link-2']")[index].InnerText;

        var xx = item.SelectNodes("//div[@class='new-link-3']//a")[index];

        MessageBox.Show(item.InnerText);
        MessageBox.Show(x);
        MessageBox.Show(xx.Attributes["href"].Value);

    }

and html

<div id="new-link">
                <ul>
                <li>
                    <div class="new-link-1"> فصل پنجم</div>
                    <div class="new-link-2"> تکمیل شده</div>
                    <div class="new-link-3">
                        <a href="http://dlldsubtitle.info/Serial/1397/Silicon.Valley.S05_WorldSubtitle.zip">دانلود با لینک مستقیم</a>
                    </div>
                </li>

                <li class="new-link-no-titel">
                    <div class="new-link-1"> فصل ششم</div>
                    <div class="new-link-2"> درحال پخش</div>
                    <div class="new-link-3">
                        <i class="fa fa-arrow-down" title=حال پخش">

                        </i>
                    </div>
                </li>
                <li>
                    <divs="new-link-1"> قسمت 1</div>
                    <div class="new-link-2"> پخش شده</div>
                    <div class="new-link-3">
                        <a href="http://dl.worldsubtitle.info/Serial/1398/Silicon.Valley.S06E01_WorldSubtitle.zip">دانلودلینک مستقیم</a>
                    </div>
                </li>

                <li>
                    <div class="new-link-1"> قسمت 7</div>
                    <div class="new-link-2"> پخش شده</div>
                    <div class="new-link-3">
                        <a href="http://dl.worldsubtitle.info/Serial/1398/Silicon.Valley.S06E07_WorldSubtitle.zip">دانلود با لینک مستقیم</a>
                    </div>
                </li>
            </ul>
        </div>

This is what I found to be the issue with your code.

foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[@class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[@class='new-link-2']")    // -> This produces 4 nodes
item.SelectNodes("//div[@class='new-link-3']//a") // -> This produces only 3 nodes 

Issue: When you search with //div, you search All nodes.. not just from the item you are currently on.

Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. ( Excerpt from here )

    foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
    {
        try
        {
            var x0 = item.SelectSingleNode(".//div[@class='new-link-1']");
            var x = item.SelectSingleNode(".//div[@class='new-link-2']");
            var xx = item.SelectSingleNode(".//a");

            MessageBox.Show(x0.InnerText);
            MessageBox.Show(x.InnerText);
            if (xx.Attributes["href"] != null)
                MessageBox.Show(xx.Attributes["href"].Value);
        }
        catch { }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM