Platform:
Python version: 3.7.3
Selenium Version: 3.141.0
OS: Win7
Issue:
I have a url list as a text file, with each url on a separate line. The urls are download link. I want to iterate through all the urls and download the files linked to each url, into a specific folder.
The code that I have tried is a nested for-while loop. The first iteration goes through without any issue, but the second iteration gets stuck on one of the while loop.
There obviously is a better way of doing what I am trying to do. I am just a beginner in python and learning the language as best as I can.
My Url List:
https://mega.nz/#!bOgBWKiB!AWs3JSksW0mpZ8Eob0-Qpr5ZAG0N1zhoFBFVstNJfXs
https://mega.nz/#!qPxGAAYJ!BX-hv7jgE4qvBs_uhHPVpsLRm1Yl4JkZ17nI1-U6hvk
https://mega.nz/#!GPoiHaaT!TAKT4sOhIiMUSFFSmlvPOidMcscXzHH_8HgK27LyTRM
Code that I have tried:
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from pathlib import Path
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe')
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "H:\\downloads")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip")
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile=fp, executable_path=r'C:\\Program Files\\Python\\Python37\\Lib\\site-packages\\selenium\\webdriver\\firefox\\geckodriver.exe')
driver.set_window_size(1600, 1050)
with open("H:\\downloads\\my_url_list.txt", "r") as f:
for url in f:
driver.get(url.strip())
sleep(5)
while True:
# checks whether the element is available on the page, used 'while' instead of 'wait' as I couuldn't figure out the wait time.
try:
content = driver.find_element_by_css_selector('div.buttons-block:nth-child(1) > div:nth-child(2)')
break
except NoSuchElementException:
continue
# used 'execute_script' instead of 'click()' due to "scroll into view error"
driver.execute_script("arguments[0].click();", content)
sleep(5)
while True:
# checks whether 'filename' element is available on the page, the page shows multiple elements depending on interaction.
if driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"):
filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]").text
break
elif driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"):
filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]").text
break
else:
sleep(5)
print(filename)
dirname = 'H:\\downloads'
suffix = '.zip'
file_path = Path(dirname, filename).with_suffix(suffix)
while True:
# checks whether the file has downloaded into the folder.
if os.path.isfile(file_path):
break
What's happening:
The first iteration goes through - File (linked to the url) gets downloaded into the H:\\\\downloads
folder and filename
gets printed.
In case of the second iteration, the file gets downloaded into the folder but the filename doesn't get printed, the second while loop involved goes into indefinite loop.
No for iteration after the second run, as the filename
can't be retrieved in the second iteration, loop goes into indefinite mode.
Second while loop in the code above:
while True:
# checks whether 'filename' element is available on the page, the page shows multiple elements depending on interaction.
if driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"):
filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]").text
break
elif driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"):
filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]").text
break
else:
sleep(5)
Attached images for filename xpath option ( the reason why two different xpath were chosen for filename)
What you are searching for is an explicit wait, I advise you to visit this page from Selenium-python documentation. I quote from the page:
An explicit wait is a code you define to wait for a certain condition to occur before proceeding further in the code. The extreme case of this is time.sleep(), which sets the condition to an exact time period to wait. There are some convenience methods provided that help you write code that will wait only as long as required. WebDriverWait in combination with ExpectedCondition is one way this can be accomplished.
If you want to know more about ExpectedCondition you can visit this link of the documentation
I suggest this code for your case, using a lambda function because you are waiting for at least one element.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
xpath1="/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"
xpath2="/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"
timeLimit = 15 #seconds, you really need to set a time out.
element = WebDriverWait(driver, timeLimit).until( lambda driver: driver.find_elements(By.xpath, xpath1) or driver.find_elements(By.xpath, xpath2) )
finally:
pass
This waits up to 15 seconds before throwing a TimeoutException unless it finds one of the elements you are waiting by xpath. WebDriverWait by default calls the ExpectedCondition every 500 milliseconds until it returns successfully so you don't need to handle the logic and the loops as you are trying to do.
For handling the TimeoutException you can for exemple refresh the page.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.