简体   繁体   中英

C# html viewing using html agility pack

I made a console c# application which is supposed to display the html source of a page.

Instead, the console app is showing HtmlAgilityPack.HtmlDocument .

Can anyone explain to me why that is?

class Program
{
    public HtmlDocument read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = Convert.ToString(dis.read());
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}

replace

 return document;

with:

 return document.DocumentNode.InnerHtml;

or if you wanna to extract text only (without HTML tags):

 return document.DocumentNode.InnerText;

the whole code would be:

class Program
{
    public string read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document.DocumentNode.InnerHtml;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = dis.read();
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}

The default implementation of .ToString() is just to output the name of the class, which is what you're seeing. So HtmlDocument from the HtmlAgilityPack obviously doesn't provide a derived implementation.

From glancing at the code over on CodePlex , it looks like you need to use the Save function to save the output to an XmlWriter and then use that to get the string. I don't see another way to get at the whole contents of the page directly from that object (though admittedly I just scanned it).

Edit: Amine Hajyoussef pointed you in the right direction with document.DocumentNode.Innerhtml , though note that you'll need to change the return type of the function as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM