简体   繁体   English

Python:无法在网页中使用硒进行下载

[英]Python: Unable to download with selenium in webpage

My purpose it to download a zip file from https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa It is a link in this webpage https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa .我的目的是从https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa下载 zip 文件 这是此网页中的链接https://www.shareinvestor.com/prices/price_download .html#/?type=price_download_all_stocks_bursa Then save it into this directory "/home/vinvin/shKLSE/ (I am using pythonaywhere). Then unzip it and the csv file extract in the directory.然后将它保存到这个目录"/home/vinvin/shKLSE/ (我使用的是pythonaywhere)。然后解压它并将csv文件解压到目录中。

The code run until the end with no error but it does not downloaded.代码运行到最后,没有错误,但没有下载。 The zip file is automatically downloaded when click on https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa manually.手动点击https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa时会自动下载 zip 文件。

My code with a working username and password is used.使用了我的带有工作用户名和密码的代码。 The real username and password is used so that it is easier to understand the problem.使用真实的用户名和密码,以便更容易理解问题。

    #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
   zip_ref.extractall(/home/vinvin/sh/KLSE)
   zip_ref.close()
   os.remove(zip_ref)

HTML snippet: HTML 片段:

<li><a href="/prices/price_download_zip_file.zip?type=history_all&amp;market=bursa">All Historical Data</a> <span>About 220 MB</span></li>

Note that &amp is shown when I copy the snippet.请注意,当我复制代码段时会显示 &amp。 It was hidden from view source, so I guess it is written in JavaScript.它在查看源代码中是隐藏的,所以我猜它是用 JavaScript 编写的。

Observation I found我发现的观察

  1. The directory home/vinvin/shKLSE do not created even I run the code with no error即使我没有错误地运行代码,也没有创建目录home/vinvin/shKLSE

  2. I try to download a much smaller zip file which can be completed in a second but still do not download after a wait of 30s.我尝试下载一个小得多的 zip 文件,该文件可以在一秒钟内完成,但在等待 30 秒后仍然没有下载。 dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()

在此处输入图片说明

I don't see any major drawback in your code block as such.我在您的代码块中没有看到任何主要缺点。 But here are a few recommendations through this Solution & the execution of this Automated Test Script:但是这里有一些通过这个解决方案和这个自动化测试脚本的执行的建议:

  1. This code works perfect in Off Market Hours.此代码在非市场时间完美运行。 During Market Hours a lot of JavaScript & Ajax Calls are in play and handling those are beyond the scope of this Question.在交易时段,许多JavaScriptAjax Calls都在起作用,处理这些超出了本问题的范围。
  2. You may consider checking for the the intended download directory first & if not available, create a new one.您可以考虑先检查预期的下载目录,如果不可用,请创建一个新目录。 That code block for this functionality is in Windows style and works perfect on Windows platform.此功能的代码块采用 Windows 风格,在 Windows 平台上运行完美。
  3. Once you click on "Login" induce some wait for the HTML DOM to render properly.单击“登录”后,将wait HTML DOM 正确呈现。
  4. When you want to see off the downloading process, you need to set certain more preferences in the FirefoxProfile as mentioned in my code below.当您想查看下载过程时,您需要在FirefoxProfile设置更多首选项,如下面的代码所述。
  5. Always consider maximizing the browser window through browser.maximize_window()始终考虑通过browser.maximize_window()最大化浏览器窗口
  6. When you start downloading you need to wait for sufficient amount of time to get the file completely downloaded.当您开始下载时,您需要等待足够的时间来完全下载文件。
  7. If you are using browser.quit() at the end you don't need to use browser.close()如果你在最后使用browser.quit()你不需要使用browser.close()
  8. You may consider to replace all the time.sleep() with either of ImplicitlyWait or ExplicitWait or FluentWait .您可以考虑将FluentWait time.sleep()替换为ImplicitlyWaitExplicitWaitFluentWait
  9. Here is your own code block with some simple tweaks in it:这是您自己的代码块,其中有一些简单的调整:

     #!/usr/bin/python print "hello from python 2" import urllib2 from selenium import webdriver from selenium.webdriver.common.keys import Keys import time from pyvirtualdisplay import Display import requests, zipfile, os display = Display(visible=0, size=(800, 600)) display.start() newpath = 'C:\\\\home\\\\vivvin\\\\shKLSE' if not os.path.exists(newpath): os.makedirs(newpath) profile = webdriver.FirefoxProfile() profile.set_preference("browser.download.dir",newpath); profile.set_preference("browser.download.folderList",2); profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip"); profile.set_preference("browser.download.manager.showWhenStarting",False); profile.set_preference("browser.helperApps.neverAsk.openFile","application/zip"); profile.set_preference("browser.helperApps.alwaysAsk.force", False); profile.set_preference("browser.download.manager.useWindow", False); profile.set_preference("browser.download.manager.focusWhenStarting", False); profile.set_preference("browser.helperApps.neverAsk.openFile", ""); profile.set_preference("browser.download.manager.alertOnEXEOpen", False); profile.set_preference("browser.download.manager.showAlertOnComplete", False); profile.set_preference("browser.download.manager.closeWhenDone", True); profile.set_preference("pdfjs.disabled", True); for retry in range(5): try: browser = webdriver.Firefox(profile) print "firefox" break except: time.sleep(3) time.sleep(1) browser.maximize_window() browser.get("https://www.shareinvestor.com/my") time.sleep(10) login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click() time.sleep(10) print browser.current_url username = browser.find_element_by_id("sic_login_header_username") password = browser.find_element_by_id("sic_login_header_password") print "find id done" username.send_keys("bkcollection") password.send_keys("123456") print "log in done" login_attempt = browser.find_element_by_xpath("//*[@type='submit']") login_attempt.submit() browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa") print browser.current_url time.sleep(20) dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click() time.sleep(900) browser.close() browser.quit() display.stop() zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r') zip_ref.extractall(/home/vinvin/sh/KLSE) zip_ref.close() os.remove(zip_ref)

Let me know if this Answers your Question.如果这能回答您的问题,请告诉我。

I rewrote your script, with comments explaining why I made the changes I made.我重写了您的脚本,并附有注释,解释了我进行更改的原因。 I think your main problem might have been a bad mimetype, however, your script had a log of systemic issues that would have made it unreliable at best.我认为您的主要问题可能是一个糟糕的模仿类型,但是,您的脚本有系统问题的日志,充其量会使它变得不可靠。 This rewrite uses explicit waits, which completely removes the need to use time.sleep() , allowing it to run as fast as possible, while also eliminating errors that arise from network congestion.此重写使用显式等待,这完全消除了使用time.sleep()的需要,使其尽可能快地运行,同时还消除了因网络拥塞而引起的错误。

You will need do the following to make sure all modules are installed:您需要执行以下操作以确保安装了所有模块:

pip install requests explicit selenium retry pyvirtualdisplay

The script:剧本:

#!/usr/bin/python

from __future__ import print_function  # Makes your code portable

import os
import glob
import zipfile
from contextlib import contextmanager

import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait

DOWNLOAD_DIR = "/tmp/shKLSE/"


def build_profile():
    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
    # I think your `/zip` mime type was incorrect. This works for me
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                           'application/vnd.ms-excel,application/zip')

    return profile


# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager  # This turns get_browser into a context manager
def get_browser():
    # Use a context manager with Display, so it will be closed even if an
    # exception is thrown
    profile = build_profile()
    with Display(visible=0, size=(800, 600)):
        browser = webdriver.Firefox(profile)
        print("firefox")
        try:
            yield browser
        finally:
            # Let a try/finally block manage closing the browser, even if an
            # exception is called
            browser.quit()


def main():
    print("hello from python 2")
    with get_browser() as browser:
        browser.get("https://www.shareinvestor.com/my")

        # Click the login button
        # waiter is a helper function that makes it easy to use explicit waits
        # with it you dont need to use time.sleep() calls at all
        login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
        waiter.find_element(browser, login_xpath, XPATH).click()
        print(browser.current_url)

        # Log in
        username = "bkcollection"
        username_id = "sic_login_header_username"
        password = "123456"
        password_id = "sic_login_header_password"
        waiter.find_write(browser, username_id, username, by=ID)
        waiter.find_write(browser, password_id, password, by=ID, send_enter=True)

        # Wait for login process to finish by locating an element only found
        # after logging in, like the Logged In Nav
        nav_id = 'sic_loggedInNav'
        waiter.find_element(browser, nav_id, ID)

        print("log in done")

        # Load the target page
        target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
                      "type=price_download_all_stocks_bursa")
        browser.get(target_url)
        print(browser.current_url)

        # CLick download button
        all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
                          "type=history_all&market=bursa']")
        waiter.find_element(browser, all_data_xpath, XPATH).click()

        # This is a bit challenging: You need to wait until the download is complete
        # This file is 220 MB, it takes a while to complete. This method waits until
        # there is at least one file in the dir, then waits until there are no
        # filenames that end in `.part`
        # Note that is is problematic if there is already a file in the target dir. I
        # suggest looking into using the tempdir module to create a unique, temporary
        # directory for downloading every time you run your script
        print("Waiting for download to complete")
        at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
        WebDriverWait(glob.glob, 300).until(at_least_1)

        no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
        WebDriverWait(glob.glob, 300).until(no_parts)

        print("Download Done")

        # Now do whatever it is you need to do with the zip file
        # zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
        # zip_ref.extractall(DOWNLOAD_DIR)
        # zip_ref.close()
        # os.remove(zip_ref)

        print("Done!")


if __name__ == "__main__":
    main()

Full disclosure: I maintain the explicit module.完全披露:我维护显式模块。 It is designed to make using explicit waits much easier, for exactly situations like this, where websites slowly load in dynamic content based on user interactions.它旨在使使用显式等待变得更加容易,对于这种情况,网站根据用户交互缓慢加载动态内容。 You could replace all of the waiter.XXX calls above with direct explicit waits.您可以用直接显式等待替换上面的所有waiter.XXX调用。

The reason is due to the webpage is loading slowly.原因是网页加载缓慢。 I added a wait of 20 seconds after open the webpage link我在打开网页链接后添加了等待 20 秒

login_attempt.submit()
browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
print browser.current_url
time.sleep(20)
dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()

It returns no error.它没有返回错误。

Additional, /zip is incorrect MIME type.另外, /zip是不正确的 MIME 类型。 Change to profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')更改为profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

The final correction :最后更正:

   #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    # application/zip not /zip
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile('/home/vinvin/shKLSE/file.zip', 'r')
   zip_ref.extractall('/home/vinvin/shKLSE')
   zip_ref.close()
   # remove with correct path
   os.remove('/home/vinvin/shKLSE/file.zip')

Take it out side the scope of the selenium.把它拿出硒的范围。 Change the preference settings so that when the link is clicked (First check if link is valid) it gives you a pop up asking to save , now use sikuli http://www.sikuli.org/ to click on the popup.更改首选项设置,以便在单击链接时(首先检查链接是否有效)它会弹出一个要求保存的弹出窗口,现在使用 sikuli http://www.sikuli.org/单击弹出窗口。 Mime types does not always work, and there is no black and white answer why it is not working. Mime 类型并不总是有效,并且没有非黑即白的答案为什么它不起作用。

I haven't tried on the site you mentioned, however following code works perfectly and downloads the ZIP.我没有在您提到的网站上尝试过,但是以下代码可以完美运行并下载 ZIP。 if you are not able to download the zip, Mime type could be different.如果您无法下载 zip,则 Mime 类型可能不同。 you can use chrome browser and network inspection to check the mime type of the file you are trying to download.您可以使用 chrome 浏览器和网络检查来检查您尝试下载的文件的MIME 类型

在此处输入图片说明

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

browser = webdriver.Firefox(profile)
browser.get("http://www.colorado.edu/conflict/peace/download/peace.zip")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM