简体   繁体   中英

I'm trying to get a list of images from website and also save them to hard disk but it doesn't work

I'm using HtmlAgilityPack.

In this function the imageNodes in the foreach count is 0

I don't understand why the list count is 0

The website contains many images. What I want is to get a list of the images from the site and show the list in the richTextBox1 and I also want to save all the images from the site on my hard disk.

How can I fix it ?

public void GetAllImages()
{
   // Bing Image Result for Cat, First Page
   string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";

   // For speed of dev, I use a WebClient
   WebClient client = new WebClient();
   string html = client.DownloadString(url);

   // Load the Html into the agility pack
   HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
   doc.LoadHtml(html);

   // Now, using LINQ to get all Images
   List<HtmlNode> imageNodes = null;
   imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
                 where node.Name == "img"
                    && node.Attributes["class"] != null
                    && node.Attributes["class"].Value.StartsWith("img_")
                 select node).ToList();

   foreach (HtmlNode node in imageNodes)
   {
      // Console.WriteLine(node.Attributes["src"].Value);
      richTextBox1.Text += node.Attributes["src"].Value + Environment.NewLine;
   }
}

As I can see the correct class of the Bing images is sg_t . You can obtain those HtmlNodes with the following Linq query:

List<HtmlNode> imageNodes = doc.DocumentNode.Descendants("img")
    .Where(n=> n.Attributes["class"] != null && n.Attributes["class"].Value == "sg_t")
    .ToList();

This list should be filled with all the img with class = 'sg_t'

A quick look at that example page/URL in your code shows that the images you are after do not have a class type starting with "img_".

<img class="sg_t" src="http://ts2.mm.bing.net/images/thumbnail.aspx?q=4588327016989297&amp;id=db87e23954c9a0360784c0546cd1919c&amp;url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg" style="height:133px;top:2px">

I notice your code is targetting the thumnails only. You also want the full size image URL, which are in the anchor surrounding each thumbnail. You will need to pull the final URL from a href that looks like this:

<a href="/images/search?q=cat&amp;view=detail&amp;id=89929E55C0136232A79DF760E3859B9952E22F69&amp;first=0&amp;FORM=IDFRIR" class="sg_tc" h="ID=API.images,18.1"><img class="sg_t" src="http://ts2.mm.bing.net/images/thumbnail.aspx?q=4588327016989297&amp;id=db87e23954c9a0360784c0546cd1919c&amp;url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg" style="height:133px;top:2px"></a>

and decode the bit that look like: url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg

which decodes to: http://actnowtraining.files.wordpress.com/2012/02/cat.jpg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM