如何使用Python，Selenium和PhantomJS下载文件

Question

这是我的情况：我必须登录到一个网站并从那里下载CSV，而从linux服务器无头。 该页面使用JS，没有它就无法工作。

经过一些研究，我选择了Selenium和PhantomJS。 登录，设置CSV参数并使用Selenium / PhantomJS / Py3找到下载按钮没有问题，实际上令人惊讶。

但是单击下载按钮没有任何作用。 经过一些研究，我发现PhantomJS似乎不支持下载对话框和下载，但它即将出现在功能列表中。

因此，我发现下载按钮只是调用REST API网址后，以为我对urllib使用了一种解决方法。 问题是，仅当您登录到站点时它才起作用。 所以第一次尝试失败了，因为它返回了： b'{"success":false,"session":"expired"}'这很有意义，因为我希望Selenium和urllib使用不同的会话。 所以我想我在尝试使用urrlib中Seleniums驱动程序的标头：

...
url = 'http://www.foo.com/api/index'
data = urllib.parse.urlencode({
        'foopara': 'cadbrabar',
    }).encode('utf-8')
headers = {}
for cookie in driver.get_cookies():
    headers[cookie['name']] = cookie['value']
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    page = response.read()
driver.close()

不幸的是，这产生了过期会话的相同结果。 我做错什么了吗，还有其他建议可以解决吗？还是我走到了尽头？ 提前致谢。

Answer 1

我找到了一个解决方案，并希望分享。 一项要求发生了变化，我不再使用PhantomJS ，而是使用与虚拟帧缓冲区chromedriver 。 同样的结果，它完成了工作。

您需要的是：

pip install selenium pyvirtualdisplay

apt-get install xvfb

下载ChromeDriver

我将Py3.5和ovh.net的测试文件与标签而不是按钮一起使用。 脚本等待页面上显示，然后单击它。 如果您不等待该元素并且位于异步站点上，那么您尝试单击的元素可能还不存在。 下载位置是相对于脚本位置的文件夹。 该脚本会检查该目录，如果文件已被第二次延迟下载。 如果我没有记错的话，在下载过程中文件应该是.part，并且一旦它成为filename指定的.dat，脚本便会完成。 如果关闭虚拟帧缓冲区和驱动程序，下载将无法完成。 完整的脚本如下所示：

# !/usr/bin/python
# coding: utf-8

import os
import sys
import time
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import glob


def main(argv):
    url = 'http://ovh.net/files'
    dl_dir = 'downloads'
    filename = '1Mio.dat'

    display = Display(visible=0, size=(800, 600))
    display.start()

    chrome_options = webdriver.ChromeOptions()
    dl_location = os.path.join(os.getcwd(), dl_dir)

    prefs = {"download.default_directory": dl_location}
    chrome_options.add_experimental_option("prefs", prefs)
    chromedriver = "./chromedriver"
    driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)

    driver.set_window_size(800, 600)
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))

    hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
    hyperlink.click()

    while not(glob.glob(os.path.join(dl_location, filename))):
        time.sleep(1)

    driver.close()
    display.stop()

if __name__ == '__main__':
    main(sys.argv)

我希望这对以后的人有所帮助。

Answer 2

如果您要下载的按钮具有文件链接，则您可以使用python代码测试下载它，因为PhantonJs本身并不支持下载。 因此，如果您的下载按钮不提供文件链接，则无法进行测试。

要使用文件链接和phyton进行测试（以断言该文件存在），可以按照本主题进行操作。 因为我是C＃开发人员和Testes，所以我不知道用python编写没有错误的代码的更好方法，但是我确定您可以：

基本的HTTP文件下载并保存到python中的磁盘上？

Answer 3

我最近使用Selenium来利用ChromeDriver从网络上下载文件。 之所以可行，是因为Chrome浏览器会自动下载文件并将其存储在“下载”文件中。 这比使用PhantomJS容易。

我建议研究将ChromeDriver与Selenium结合使用并遵循以下路线： https : //github.com/SeleniumHQ/selenium/wiki/ChromeDriver

编辑-正如下面所指出的，我忽略了指向如何设置ChromeDriver以无头模式运行的方法。 这是更多信息： http : //www.chrisle.me/2013/08/running-headless-selenium-with-chrome/

或： https : //gist.github.com/chuckbutler/8030755

Answer 4

您可以尝试如下操作：

from requests.auth import HTTPBasicAuth
import requests

url = "http://some_site/files?file=file.csv"  # URL used to download file
#  GET-request to get file content using your web-site's credentials to access file
r = requests.get(url, auth=HTTPBasicAuth("your_username", "your_password"))
#  Saving response content to file on your computer
with open("path/to/folder/to/save/file/filename.csv", 'w') as my_file:
    my_file.write(r.content)

如何使用Python，Selenium和PhantomJS下载文件

问题描述

4 个解决方案

解决方案1
3 已采纳 2016-07-29 09:39:12

解决方案2
1 2016-07-27 14:39:43

解决方案3
1 2016-07-27 14:58:16

解决方案4
0 2016-07-27 14:51:09

如何使用Python，Selenium和PhantomJS下载文件

问题描述

4 个解决方案

解决方案1 3 已采纳 2016-07-29 09:39:12

解决方案2 1 2016-07-27 14:39:43

解决方案3 1 2016-07-27 14:58:16

解决方案4 0 2016-07-27 14:51:09

解决方案1
3 已采纳 2016-07-29 09:39:12

解决方案2
1 2016-07-27 14:39:43

解决方案3
1 2016-07-27 14:58:16

解决方案4
0 2016-07-27 14:51:09