简体   繁体   中英

How do i download all images from a List<string> with many links for images inside?

private List<string> retrieveImages(string address)
{

    System.Net.WebClient wc = new System.Net.WebClient();
    List<string> imgList = new List<string>();
    doc.Load(wc.OpenRead(address)); //or whatever HTML file you have 
    HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
    if (imgs == null) return new List<string>();

    foreach (HtmlNode img in imgs)
    {
        if (img.Attributes["src"] == null)
            continue;
        HtmlAttribute src = img.Attributes["src"];

        imgList.Add(src.Value);
        //Image imgDownload = GetImage(src.Value);
        //imgDownload.Save(@"d:\myImages");
    }
    return imgList;
}

In some case the List imgList contain 33 items and it looks like that:

In the first place [0] I see: /images/experiments/nav_logo78.png as a link for the image I don't see and http in the start or www just start with /images

Then in place [1] I see: //maps.gstatic.com/mapfiles/transparent.png

Then after some items in place [10] I see: http://mt1.google.com/vt/lyrs=m@186000000&hl=iw&src=app&x=75&y=51&z=7&s=Gali

Im not sure what gali is I don't see .bmp .gif or .png images just Gali.

What I want is to download all the this images from each link and save it to my hard disk. So I have this function for the download:

private Image GetImage(string url)
{
    System.Net.WebRequest request = System.Net.WebRequest.Create(url);

    System.Net.WebResponse response = request.GetResponse();
    System.IO.Stream responseStream = response.GetResponseStream();

    Bitmap bmp = new Bitmap(responseStream);

    responseStream.Dispose();

    return bmp;
} 

When im using this GetImage function in the retrieveImages() function it dosent do anything the program even dosent owrk I mean the List imgList is empty. If I mark this two lines as it is now:

//Image imgDownload = GetImage(src.Value);
//imgDownload.Save(@"d:\myImages");

If I mark them with // not ot use them then evrything is working but if im using them nothing is working and its not saving anything to my hard disk.

What should I do?

Edit:

I just changed my retrieveImages function to this:

private List<string> retrieveImages(string address)
        {

            System.Net.WebClient wc = new System.Net.WebClient();
            List<string> imgList = new List<string>();
            doc.Load(wc.OpenRead(address)); 
            HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
            if (imgs == null) return new List<string>();

            foreach (HtmlNode img in imgs)
            {
                if (img.Attributes["src"] == null)
                    continue;
                HtmlAttribute src = img.Attributes["src"];

                imgList.Add(src.Value);
                wc.DownloadFile(src.Value ,  @"d:\MyImages\my.gif");
            }
            return imgList;
        }

I used a breakpoint on the line wc.DownloadFile and it throw me an exception: Webexception Was Caught

Could not find a part of the path 'D:\\textinputassistant\\tia.png'.

In src.Value it waontin in this case: /textinputassistant/tia.png So you told me to avoid links that have not http or https or www in the start i will fix it. The question is if the exception is since this line start with / and it dosent have any http/s/www ?

The full exception:

System.Net.WebException was caught
  Message=Could not find a part of the path 'D:\textinputassistant\tia.png'.
  Source=System
  StackTrace:
       at System.Net.WebClient.DownloadFile(Uri address, String fileName)
       at System.Net.WebClient.DownloadFile(String address, String fileName)
       at GatherLinks.Form1.retrieveImages(String address) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 328
       at GatherLinks.Form1.webCrawler(String url, Int32 levels, DoWorkEventArgs eve) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 97
  InnerException: System.Net.WebException
       Message=Could not find a part of the path 'D:\textinputassistant\tia.png'.
       Source=System
       StackTrace:
            at System.Net.FileWebResponse..ctor(FileWebRequest request, Uri uri, FileAccess access, Boolean asyncHint)
            at System.Net.FileWebRequest.GetResponseCallback(Object state)
       InnerException: System.IO.DirectoryNotFoundException
            Message=Could not find a part of the path 'D:\textinputassistant\tia.png'.
            Source=mscorlib
            StackTrace:
                 at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
                 at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath)
                 at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, String msgPath, Boolean bFromProxy)
                 at System.Net.FileWebStream..ctor(FileWebRequest request, String path, FileMode mode, FileAccess access, FileShare sharing, Int32 length, Boolean async)
                 at System.Net.FileWebResponse..ctor(FileWebRequest request, Uri uri, FileAccess access, Boolean asyncHint)

InnerException:

Just added a filter so it will save only links that start with http:

private List<string> retrieveImages(string address)
        {

            System.Net.WebClient wc = new System.Net.WebClient();
            List<string> imgList = new List<string>();
            doc.Load(wc.OpenRead(address));
            HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
            if (imgs == null) return new List<string>();

            foreach (HtmlNode img in imgs)
            {
                if (img.Attributes["src"] == null)
                    continue;
                HtmlAttribute src = img.Attributes["src"];

                imgList.Add(src.Value);
                if (src.Value.Contains("http"))
                {
                    wc.DownloadFile(src.Value, @"d:\MyImages\my.gif");
                }
            }
            return imgList;
        }

Now src.Value contain: http://mt1.google.com/vt/lyrs=m@186000000&hl=iw&src=app&x=75&y=51&z=7&s=Gali

Then after it trying to download im getting exception: WebException Was Caught

The remote server returned an error: (403) Forbidden.

System.Net.WebException was caught
  Message=The remote server returned an error: (403) Forbidden.
  Source=System
  StackTrace:
       at System.Net.WebClient.DownloadFile(Uri address, String fileName)
       at System.Net.WebClient.DownloadFile(String address, String fileName)
       at GatherLinks.Form1.retrieveImages(String address) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 330
       at GatherLinks.Form1.webCrawler(String url, Int32 levels, DoWorkEventArgs eve) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 97
  InnerException: 

The question is if the exception throw up since the site in this case google is blocking downloads or since the link end with Gali wich is im not sure what type of file it is ?

I would first ignore images that don't have a valid link ie no http://

For saving a file right to disk you can download the binary for it and save like so:

string URL="http://www.yourdomain.com/file1.zip";
string DestinationPath="C:\file1.jpg";
System.Net.WebClient Client = new WebClient();
Client.DownloadFile(URL,DestinationPath);

You don't have to convert an image to a .net Image to save it. I have some similiar code in some import apps I wrote recently

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM