简体   繁体   中英

download images from a website with URL and sorting by description

I am trying to download images from a website and then be able to sort those images into folders based on their respective descriptions. in my script, i have gotten up to the part where i have parsed the HTML tags and gotten the necessary information that i need (the URL of each image, and the description of that image). I also added in this script two more columns, the name of each file and the full path with the name and folder where the file would be downloaded. I am now stuck on the next parts that i want to do. I want to be able to check for if a folder already exists, and in that same if statement, check to see if the file name already exists. If both of these are true, then the script will move onto the next link. If the file does not exist, then it will create the folder and download the file at that time. The next part of what i want to do is an elif, where is the folder does not exist, then it will create the folder and download the file. I outlined what i want this section to do below. The problem that i am running into is that i do not know how to download the files or how to check for them. I also do not know how it will work if i am to be pulling information from multiple lists. For each link, if the file is downloaded, it has to pull the full path and name from another column in the csv which is another list and i do not understand how i set it up so that i can do that. Can someone please help...!!!

My code for up until the part that i am stuck with is below this section that outlines what i want to do with the next part of my script.

for elem in full_links
        if full_path  exists
                run test for if file name exists
                if file name exists = true
                        move onto the next file
                        if last file in list
                                break
                elif  file name exists = false
                        download image to location with with name in list

        elif full_path does not exist
                download image with file path and name

Code that i have done so far:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
from pip._vendor import requests
import csv
import time
import urllib.request
import pandas as pd 
import wget



URL = 'https://www.baps.org/Vicharan'
content = requests.get(URL)

soup = BeautifulSoup(content.text, 'html.parser')

#create a csv
f=csv.writer(open('crawl3.csv' , 'w'))
f.writerow(['description' , 'full_link', 'name','full_path' , 'full_path_with_jpg_name'])



# Use the 'fullview' class 
panelrow = soup.find('div' , {'id' : 'fullview'})

main_class =  panelrow.find_all('div' , {'class' : 'col-xl-3 col-lg-3 col-md-3 col-sm-12 col-xs-12 padding5'})

# Look for 'highslide-- img-flag' links
individual_classes = panelrow.find_all('a' , {'class' : 'highslide-- img-flag'})

# Get the img tags, each <a> tag contains one
images = [i.img for i in individual_classes]

for image in images:
    src=image.get('src')
    full_link = 'https://www.baps.org' + src
    description = image.get('alt')
    name = full_link.split('/')[-1]
    full_path = '/home/pi/image_downloader_test/' + description + '/'
    full_path_with_jpg_name = full_path + name 
    f.writerow([description , full_link , name, full_path , full_path_with_jpg_name])

print('-----------------------------------------------------------------------')
print('-----------------------------------------------------------------------')
print('finished with search  and csv created. Now moving onto download portion')
print('-----------------------------------------------------------------------')
print('-----------------------------------------------------------------------')



f = open('crawl3.csv')
csv_f = csv.reader(f)

descriptions = []
full_links = []
names = []
full_path = []
full_path_with_jpg_name = []

for row in csv_f:
    descriptions.append(row[0])
    full_links.append(row[1])
    names.append(row[2])
    full_path.append(row[3])
    full_path_with_jpg_name.append(row[4])

To answer the various parts of your question:

  1. To check if a folder or file exists, use the os module

    import os if not os.path.exists(path_to_folder): os.makedirs(path_to_folder) if not os.path.exists(path_to_file): # do smth
  2. Downloading files

    If you have the src of an image, and the file name that you want to save it in, you can download the file with the urllib.request module as such

    urllib.request.urlretrieve(image_src, path_to_file)
  3. Iterating through multiple lists at the same time

    Finally, if you want to pull information from multiple lists, you can do this using the built-in zip function. For example, if you want to iterate through full_links and full_path at the same time, you can do it like so

    for link, path in zip(full_links, full_path): # do something with link and path

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM