使用Python和Selenium從網站高效下載圖像

Question

免責聲明：我沒有網絡抓取/ HTML / javascripts / css之類的背景，但是我知道一些Python。

我的最終目標是使用相關標簽在ShapeNet網站中下載每3515個汽車視圖的所有第4個圖像視圖。 例如，3515對中的第一對是可以在該圖片右側的折疊菜單中找到的圖片：（可以通過單擊第一頁的第一項，然后單擊圖片來加載） 帶有相關標簽“ sport utility”的圖片，如第一張圖片所示（第一輛車的左上方）。

為此，我在@DebanjanB的幫助下編寫了一段代碼，單擊第一張圖片上的sport實用程序可打開iframe點擊圖片，然后將第四個圖片鏈接下載到我的問題。 完整的工作代碼如下：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()

browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[@id='02958343_anchor']")
linkElem.click()
#Page is also long to display iframe
element = wait.until(EC.element_to_be_clickable((By.ID, "model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")))
linkElem = browser.find_element_by_id("model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")
linkElem.click()
#iframe slow to be displayed
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
#iframe = browser.find_elements_by_id('viewerIframe')
#browser.switch_to_frame(iframe[0])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()



img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[@class='searchResult' and @id='image.3dw.bcf0b18a19bce6d91ad107790a9e2d51.3']/img[@class='enlarge']")
src = img.get_attribute('src')


os.system("wget %s --no-check-certificate"%src)

這有幾個問題。 首先，我需要手動了解每個模型的xpath model_3dw_ bcf0b18a19bce6d91ad107790a9e2d51 ，我還需要提取它們都可以在以下位置找到的標簽： 。 因此，我需要通過檢查顯示的每個圖像來提取它。 然后，我需要切換頁面（有22頁），甚至可能在每個頁面上向下滾動以確保我擁有所有內容。 其次，我不得不使用time.sleep兩次，因為其他基於等待可點擊的方法似乎無法按預期工作。

我有兩個問題，第一個是顯而易見的，這是正確的方法嗎？ 我覺得即使沒有時間也可以很快，睡眠感覺很像人類會做的事情，因此如果確實要走，那么第二效率必須非常低下：我該如何編寫double for loop on頁面和項目能夠有效地提取標簽和模型ID？

編輯1：似乎：

l=browser.find_elements_by_xpath("//div[starts-with(@id,'model_3dw')]")

可能是邁向完成的第一步

編輯2：幾乎在那里，但是代碼充滿了time.sleep。 仍然需要獲取標簽名稱並遍歷頁面

編輯3：獲得標記名稱仍需要循環瀏覽頁面，並將發布解決方案的初稿

Answer 1

因此，讓我嘗試正確理解您的意思，然后查看是否可以幫助您解決問題。 我不了解Python，所以請原諒我的synthax錯誤。

您要單擊每輛183533汽車，然后在彈出的iframe中下載第4張圖片。 正確？

現在，如果是這種情況，讓我們看一下您需要的第一個元素，即頁面上裝有所有汽車的元素。

因此，要獲得第1頁的所有160輛汽車，您將需要：

elements = browser.find_elements_by_xpath("//img[@class='resultImg lazy']");

這將為您返回160個圖像元素。 確切顯示的圖像量（第1頁）

然后您可以說：

for el in elements:
    {here you place the code you need to download the 4th image, 
     so like switch to iframe, click on the 4th image etc.}

現在，對於第一頁，您已經制作了一個循環，該循環將為上面的每輛車下載第4張圖像。

由於您有多個頁面，因此不能完全解決您的問題。 值得慶幸的是，第一頁和/或最后一頁的上一頁和下一頁頁面導航均顯示為灰色。

所以你可以說：

browser.find_element_by_xpath("//a[@class='next']").click();

只要確保您捕獲到元素不可點擊的位置即可，因為元素將在最后一頁顯示為灰色。

Answer 2

與其抓取網站，不如考慮檢查網頁用來查詢數據的URL，然后使用Python的“ requests”包直接從服務器直接發出API請求。 我不是該站點的注冊用戶，因此無法向您提供任何示例，但是描述shapenet.org網站的論文特別提到：

“為了方便地訪問ShapeNet中包含的所有模型和注釋數據，我們使用Apache Solr框架在所有3D模型及其相關注釋上構建了一個索引。每個給定的3D模型都存儲了一個注釋包含在索引中作為單獨的屬性，可以通過簡單的基於Web的UI輕松查詢和過濾。此外，為了使研究人員可以方便地訪問數據集，我們提供了批量下載功能。”

這表明只要您能夠了解他們的查詢語言所提供的內容，通過API進行所需的操作可能會更容易。 在他們的質量檢查/論壇中進行搜索可能也很有用。

Answer 3

我想出了這個答案，這是一種有效的方法，但是我不知道如何刪除多個time.sleep睡眠，直到有人發現更優雅的東西（也就是最后一個結尾時）我才會接受我的答案頁面失敗）：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()

browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='02958343_anchor']")))

linkElem = browser.find_element_by_xpath("//*[@id='02958343_anchor']")
linkElem.click()

tag_names=[]
page_count=0
while True:

    if page_count>0:
        browser.find_element_by_xpath("//a[@class='next']").click()
    time.sleep(2)
    wait.until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(@id,'model_3dw')]")))  
    list_of_items_on_page=browser.find_elements_by_xpath("//div[starts-with(@id,'model_3dw')]")
    list_of_ids=[e.get_attribute("id") for e in list_of_items_on_page]

    for i,item in enumerate(list_of_items_on_page):
    #Page is also long to display iframe
        current_id=list_of_ids[i]
        element = wait.until(EC.element_to_be_clickable((By.ID, current_id)))
        car_image=browser.find_element_by_id(current_id)
        original_tag_name=car_image.find_element_by_xpath("./div[@style='text-align: center']").get_attribute("innerHTML")

        count=0
        tag_name=original_tag_name
        while tag_name in tag_names:            
            tag_name=original_tag_name+"_"+str(count)
            count+=1

        tag_names.append(tag_name)



        car_image.click()


        wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))

        element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
        time.sleep(10)
        linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
        linkElem.click()

        img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[@class='searchResult' and @id='image.3dw.%s.3']/img[@class='enlarge']"%current_id.split("_")[2])
        src = img.get_attribute('src')
        os.system("wget %s --no-check-certificate -O %s.png"%(src,tag_name))
        browser.switch_to.default_content()
        browser.find_element_by_css_selector(".btn-danger").click()
        time.sleep(1)

    page_count+=1

也可以從硒中導入NoSuchElementException，並使用try進行while True循環，除了擺脫任意的time.sleep。

使用Python和Selenium從網站高效下載圖像

問題描述

3 個解決方案

解決方案1
1 2018-01-22 14:32:18

解決方案2
1 2018-01-22 17:37:58

解決方案3
0 2018-01-22 16:14:52

使用Python和Selenium從網站高效下載圖像

問題描述

3 個解決方案

解決方案1 1 2018-01-22 14:32:18

解決方案2 1 2018-01-22 17:37:58

解決方案3 0 2018-01-22 16:14:52

解決方案1
1 2018-01-22 14:32:18

解決方案2
1 2018-01-22 17:37:58

解決方案3
0 2018-01-22 16:14:52