简体   繁体   中英

Efficient download of images from website with Python and selenium

Disclaimer: I do not have any background in web-scraping/HTML/javascripts/css and the likes but I know a bit of Python.

My end goal is to download all 4th image view of every 3515 car views in the ShapeNet website WITH the associated tag. 在此处输入图片说明 For instance the first of the 3515 couples would be the image that can be found in the collapse menu on the right of this picture: (that can be loaded by clicking on the first item of the first page and then on Images) 在此处输入图片说明 with the associated tag "sport utility" as can be seen in the first picture (first car top left).

To do that I wrote with the help of @DebanjanB a snippet of code that click on the sport utility on the first picture opens the iframe clicks on images and then download the 4th picture link to my question . The full working code is this one:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()

browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[@id='02958343_anchor']")
linkElem.click()
#Page is also long to display iframe
element = wait.until(EC.element_to_be_clickable((By.ID, "model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")))
linkElem = browser.find_element_by_id("model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")
linkElem.click()
#iframe slow to be displayed
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
#iframe = browser.find_elements_by_id('viewerIframe')
#browser.switch_to_frame(iframe[0])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()



img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[@class='searchResult' and @id='image.3dw.bcf0b18a19bce6d91ad107790a9e2d51.3']/img[@class='enlarge']")
src = img.get_attribute('src')


os.system("wget %s --no-check-certificate"%src)

There are several issues with this. First I need to know by hand the xpath model_3dw_ bcf0b18a19bce6d91ad107790a9e2d51 for each model I also need to extract the tag they both can be found at: 在此处输入图片说明 . So I need to extract it by inspecting every image displayed. Then I need to switch page (there are 22 pages) and maybe even scroll down on each page to be sure I have everything. Secondly I had to use time.sleep twice because the other method based on wait to be clickable does not seem to work as intented.

I have two questions the first one is obvious is it the right way of proceeding ? I feel that even if this could be quite fast without the time.sleep this feels very much like what a human would do and therefore must be terribly inefficient secondly if it is indeed the way to go: How could I write a double for loop on pages and items to be able to extract the tag and model id efficiently ?

EDIT 1: It seems that:

l=browser.find_elements_by_xpath("//div[starts-with(@id,'model_3dw')]")

might be the first step towards completion

EDIT 2: Almost there but the code is filled with time.sleep. Still need to get the tag name and to loop through the pages

EDIT 3: Got the tag name still need to loop through the pages and will post first draft of solution

So let me try to understand correctly what you mean and then see if I can help you solve the problem. I do not know Python, so excuse my synthax errors.

You want to click on each and every of the 183533 cars, and then download the 4th image within the iframe that pops up. Correct?

Now if this is the case, lets look at the first element you need, elements on the page with all the cars on it.

So to get all 160 cars of page 1, you are going to need:

elements = browser.find_elements_by_xpath("//img[@class='resultImg lazy']");

This is going to return 160 image elements for you. Which is exactly the amount of the displayed images (on page 1)

Then you can say:

for el in elements:
    {here you place the code you need to download the 4th image, 
     so like switch to iframe, click on the 4th image etc.}

Now, for the first page, you have made a loop which will download the 4th image for every vehicle on it.

This doens't entirely solve your problem as you have multiple pages. Thankfully, the page navigation, previous and next, are greyed out on first and/or last page.

So you can just say:

browser.find_element_by_xpath("//a[@class='next']").click();

Just make sure you catch if element is not clickable as element will be greyed out on the last page.

Rather than scraping the site, you might consider examining the URLs that the webpage uses to query the data, then use the Python 'requests' package to simply make API requests directly from the server. I'm not a registered user on the site, so I can't provide you with any examples, but the paper that describes the shapenet.org site specifically mentions:

"To provide convenient access to all of the model and an- notation data contained within ShapeNet, we construct an index over all the 3D models and their associated annota- tions using the Apache Solr framework. Each stored an- notation for a given 3D model is contained within the index as a separate attribute that can be easily queried and filtered through a simple web-based UI. In addition, to make the dataset conveniently accessible to researchers, we provide a batched download capability."

This suggests that it might be easier to do what you want via API, as long as you can learn what their query language provides. A search in their QA/Forum may be productive too.

I came up with this answer, which kind of works but I don't know how to remove the several calls to time.sleep I will not accept my answer until someone finds something more elegant (also when it arrives at the end of the last page it fails):

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()

browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='02958343_anchor']")))

linkElem = browser.find_element_by_xpath("//*[@id='02958343_anchor']")
linkElem.click()

tag_names=[]
page_count=0
while True:

    if page_count>0:
        browser.find_element_by_xpath("//a[@class='next']").click()
    time.sleep(2)
    wait.until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(@id,'model_3dw')]")))  
    list_of_items_on_page=browser.find_elements_by_xpath("//div[starts-with(@id,'model_3dw')]")
    list_of_ids=[e.get_attribute("id") for e in list_of_items_on_page]

    for i,item in enumerate(list_of_items_on_page):
    #Page is also long to display iframe
        current_id=list_of_ids[i]
        element = wait.until(EC.element_to_be_clickable((By.ID, current_id)))
        car_image=browser.find_element_by_id(current_id)
        original_tag_name=car_image.find_element_by_xpath("./div[@style='text-align: center']").get_attribute("innerHTML")

        count=0
        tag_name=original_tag_name
        while tag_name in tag_names:            
            tag_name=original_tag_name+"_"+str(count)
            count+=1

        tag_names.append(tag_name)



        car_image.click()


        wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))

        element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
        time.sleep(10)
        linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
        linkElem.click()

        img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[@class='searchResult' and @id='image.3dw.%s.3']/img[@class='enlarge']"%current_id.split("_")[2])
        src = img.get_attribute('src')
        os.system("wget %s --no-check-certificate -O %s.png"%(src,tag_name))
        browser.switch_to.default_content()
        browser.find_element_by_css_selector(".btn-danger").click()
        time.sleep(1)

    page_count+=1

One can also import a NoSuchElementException from selenium and use a while True loop with try except to get rid of the arbitrary time.sleep.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM