简体   繁体   中英

Retrieving data from HTML Code using HTML Agility Pack

The following code is for retrieving the top tweets in the sites: http://favstar.fm/all-time-most-favorited-tweets

When i run the code i found nothing retrieved from the HTML Nodes but i viewed the source code and i found :

<p class='fs-tweet-text'>If only Bradley's arm was longer. Best photo ever. <a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23oscars" title="#oscars">#oscars</a> <a href="http://t.co/C9U5NOtGap" title="http://twitter.com/TheEllenShow/status/440322224407314432/photo/1">pic.twitter.com/C9U5NOtGap</a></p>

Source:

        HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc;
        try
        {
            doc = web.Load("http://favstar.fm/all-time-most-favorited-tweets");

            var Tweetsnodes = doc.DocumentNode.SelectNodes("//p[@class='fs-tweet-text]").ToList();
            if (Tweetsnodes != null)
            {
                for (int i = 0; i <= 4; i++)
                {
                    URLs.Add(Tweetsnodes[i].ToString());
                }
            }
            var Usernodes = doc.DocumentNode.SelectNodes("//a [@class='fs-tweeter']").ToList();
            if (Usernodes != null)
            {
                for (int i = 0; i <= 4; i++)
                {
                    Titles.Add(Usernodes[i].ToString());
                }
            }
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.Message);
        }

.. Can anyone tell me why i retrieved nothing ?

Your sites requires User-Agent header be set. (See what your code returns var html = doc.DocumentNode.InnerHtml; ).

You can set it as:

 web.UserAgent = "Stackoverflow/1.0";

After fixing the little typo //p[@class='fs-tweet-text'] in your xpath, it should work

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM