Using HTMLAgility Pack to Extract Links

Question

Consider this simplest piece of code:

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace WebScraper
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml("http://www.google.com");

            foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
            }
        }
    }
}

This effectively doesnt do anything at all, and is copied/inspired from various other StackOverflow questions like this . When compiling this, there is a runtime error which says "Object reference not set to an instance of an object." highlighting the foreach line.

I can't understand, why the environment has become irritable to this humble,innocent and useless piece of code.

I would also like to know, does HTMLAgilityPack accept HTML classes as nodes?

Answer 1

If you want to load html from the web, you need to use the HtmlWeb object:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc =web.Load(url);

Answer 2

LoadHtml takes a string of actual HTML as an argument. You can pass Load a Stream from WebResponse.GetResponseStream() instead.

WebRequest req = WebRequest.Create( "http://www.google.com" );
Stream s = req.GetResponse().GetResponseStream();
doc.Load(s);

Using HTMLAgility Pack to Extract Links

Question

2 answers

solution1
4 ACCPTED 2010-06-05 11:40:35

solution2
1 2010-06-05 11:28:10

Using HTMLAgility Pack to Extract Links

Question

2 answers

solution1 4 ACCPTED 2010-06-05 11:40:35

solution2 1 2010-06-05 11:28:10

solution1
4 ACCPTED 2010-06-05 11:40:35

solution2
1 2010-06-05 11:28:10