I'm trying to make a script to download specific PDF files from BlackRock websites (ishares.com or blackrock.com), but the click() function usually doesn't work. Sometimes it does though - once every 3-5 executions or so, it manages to download one file.
(When I used a similar script for all PDFs from those websites it also worked only once in a few executions, and it downloaded always the same files every time it somewhat worked, skipping the rest.)
So, let's say I attempt to download KIID/KID PDF files from those sites:
https://www.ishares.com/uk/individual/en/products/251857/ishares-msci-emerging-markets-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true
https://www.ishares.com/ch/individual/en/products/251931/ishares-stoxx-europe-600-ucits-etf-de-fund?switchLocale=y&siteEntryPassthrough=true
https://www.blackrock.com/uk/individual/products/251565/ishares-euro-corporate-bond-large-cap-ucits-etf?switchLocale=y&siteEntryPassthrough=true
with this code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyvirtualdisplay import Display
import time
def blackrock_getter(url):
with Display():
mime_types = "application/pdf,application/vnd.adobe.xfdf,application/vnd.fdf,application/x-pdf,application/vnd.adobe.xdp+xml"
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/home/user/kiid_temp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', mime_types)
profile.set_preference("plugin.disable_full_page_plugin_for_types", mime_types)
profile.set_preference('pdfjs.disabled', True)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
try:
element = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, ("//header[@class='main-header']//a[@class='icon-pdf'][1]"))))
driver.execute_script("arguments[0].click();", element)
finally:
driver.quit()
time.sleep(3) # very precise mechanism to wait until the download is complete
def main():
urls_file = open('urls_list.txt', 'r') # the URLs I pasted above
for url in urls_file.readlines():
if url[-1:] == "\n":
url = url[:-1]
if url[0:4] == "http":
filename = url.split('?')[0]
filename = filename.split('/')[-1]
if 'blackrock.com/' in url or 'ishares.com/' in url:
print(f"Processing {filename}...")
blackrock_getter(url)
main()
The result is (every once in a while) one file: kiid-ishares-msci-emerging-markets-ucits-etf-dist-gb-ie00b0m63177-en.pdf.
Any ideas how to fix this?
You could try to use the pyautogui
module, but you would'n be able to use your computer while your program is running.
Seems to be the script is completing before the file download is completed, I mean downloading is not competing with in 3 seconds. Here is the method that will wait until the PDF download completes.
# method to get the downloaded file name
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.