如何使用Python，Selenium和PhantomJS下载文件

Question

Here is my situation: I have to login to a Website and download a CSV from there, headless from a linux server. 这是我的情况：我必须登录到一个网站并从那里下载CSV，而从linux服务器无头。 The page uses JS and does not work without it. 该页面使用JS，没有它就无法工作。

After some research I went with Selenium and PhantomJS. 经过一些研究，我选择了Selenium和PhantomJS。 Logging in, setting the parameters for the CSV and finding the download button with Selenium/PhantomJS/Py3 was no problem, actually surprisingly enjoyable. 登录，设置CSV参数并使用Selenium / PhantomJS / Py3找到下载按钮没有问题，实际上令人惊讶。

But clicking the download button did not do anything. 但是单击下载按钮没有任何作用。 After some research I found out that PhantomJS does not seem to support download-dialogs and downloads but that it is on the upcoming feature list. 经过一些研究，我发现PhantomJS似乎不支持下载对话框和下载，但它即将出现在功能列表中。

So I thought I use a workaround with urllib after I found out that the download button is just calling a REST API Url. 因此，我发现下载按钮只是调用REST API网址后，以为我对urllib使用了一种解决方法。 Problem is, it only works if you're logged into the site. 问题是，仅当您登录到站点时它才起作用。 So the first attempt failed as it returned: b'{"success":false,"session":"expired"}' which makes sense as I expect Selenium and urllib to use different sessions. 所以第一次尝试失败了，因为它返回了： b'{"success":false,"session":"expired"}'这很有意义，因为我希望Selenium和urllib使用不同的会话。 So I thought I use the headers from Seleniums driver in urrlib trying this: 所以我想我在尝试使用urrlib中Seleniums驱动程序的标头：

...
url = 'http://www.foo.com/api/index'
data = urllib.parse.urlencode({
        'foopara': 'cadbrabar',
    }).encode('utf-8')
headers = {}
for cookie in driver.get_cookies():
    headers[cookie['name']] = cookie['value']
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    page = response.read()
driver.close()

Unfortunately this yielded the same result of an expired session. 不幸的是，这产生了过期会话的相同结果。 Am I doing somthing wrong, is there a way around this, other suggestions or am I at a dead end? 我做错什么了吗，还有其他建议可以解决吗？还是我走到了尽头？ Thanks in advance. 提前致谢。

Answer 1

I found a solution and wanted to share it. 我找到了一个解决方案，并希望分享。 One requirement changed, I am not using PhantomJS anymore but the chromedriver which works headlessly with a virtual framebuffer. 一项要求发生了变化，我不再使用PhantomJS ，而是使用与虚拟帧缓冲区chromedriver 。 Same result and it gets the job done. 同样的结果，它完成了工作。

What you need is: 您需要的是：

pip install selenium pyvirtualdisplay

apt-get install xvfb

Download ChromeDriver 下载ChromeDriver

I use Py3.5 and a testfile from ovh.net with an tag instead of a button. 我将Py3.5和ovh.net的测试文件与标签而不是按钮一起使用。 The script waits for the to be present on the page then clicks it. 脚本等待页面上显示，然后单击它。 If you don't wait for the element and are on an async site, the element you try to click might not be there yet. 如果您不等待该元素并且位于异步站点上，那么您尝试单击的元素可能还不存在。 The download location is a folder relative to the scripts location. 下载位置是相对于脚本位置的文件夹。 The script checks that directory if the file is downloaded already with a second delay. 该脚本会检查该目录，如果文件已被第二次延迟下载。 If I am not wrong files should be .part during download and as soon as it becomes the .dat specified in filename the script finishes. 如果我没有记错的话，在下载过程中文件应该是.part，并且一旦它成为filename指定的.dat，脚本便会完成。 If you close the virtual framebuffer and driver before the download will not complete. 如果关闭虚拟帧缓冲区和驱动程序，下载将无法完成。 The complete script looks like this: 完整的脚本如下所示：

# !/usr/bin/python
# coding: utf-8

import os
import sys
import time
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import glob


def main(argv):
    url = 'http://ovh.net/files'
    dl_dir = 'downloads'
    filename = '1Mio.dat'

    display = Display(visible=0, size=(800, 600))
    display.start()

    chrome_options = webdriver.ChromeOptions()
    dl_location = os.path.join(os.getcwd(), dl_dir)

    prefs = {"download.default_directory": dl_location}
    chrome_options.add_experimental_option("prefs", prefs)
    chromedriver = "./chromedriver"
    driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)

    driver.set_window_size(800, 600)
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))

    hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
    hyperlink.click()

    while not(glob.glob(os.path.join(dl_location, filename))):
        time.sleep(1)

    driver.close()
    display.stop()

if __name__ == '__main__':
    main(sys.argv)

I hope this helps someone in the future. 我希望这对以后的人有所帮助。

Answer 2

If the button that you want to download has the file link, you are able to test downloading it using python code, because PhantonJs does not support a download by itself. 如果您要下载的按钮具有文件链接，则您可以使用python代码测试下载它，因为PhantonJs本身并不支持下载。 So, if your download button does not provide the file link, you're not able to test. 因此，如果您的下载按钮不提供文件链接，则无法进行测试。

To test using file link and phyton (to assert that file exists), you can follow this topic. 要使用文件链接和phyton进行测试（以断言该文件存在），可以按照本主题进行操作。 As I'm a C# developer and testes, I don't know the better way to write the code in python without errors, but Im sure you can: 因为我是C＃开发人员和Testes，所以我不知道用python编写没有错误的代码的更好方法，但是我确定您可以：

Basic http file downloading and saving to disk in python? 基本的HTTP文件下载并保存到python中的磁盘上？

Answer 3

I recently used Selenium to utilize ChromeDriver to download a file from the web. 我最近使用Selenium来利用ChromeDriver从网络上下载文件。 This works because Chrome automatically downloads the file and stores it in the Downloads file for you. 之所以可行，是因为Chrome浏览器会自动下载文件并将其存储在“下载”文件中。 This was easier than using PhantomJS. 这比使用PhantomJS容易。

I recommend looking into using ChromeDriver with Selenium and going that route: https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver 我建议研究将ChromeDriver与Selenium结合使用并遵循以下路线： https : //github.com/SeleniumHQ/selenium/wiki/ChromeDriver

EDIT - As pointed out below, I neglected to point to how to set up ChromeDriver to run in headless mode. 编辑-正如下面所指出的，我忽略了指向如何设置ChromeDriver以无头模式运行的方法。 Here's more info: http://www.chrisle.me/2013/08/running-headless-selenium-with-chrome/ 这是更多信息： http : //www.chrisle.me/2013/08/running-headless-selenium-with-chrome/

Or: https://gist.github.com/chuckbutler/8030755 或： https : //gist.github.com/chuckbutler/8030755

Answer 4

You can try something like: 您可以尝试如下操作：

from requests.auth import HTTPBasicAuth
import requests

url = "http://some_site/files?file=file.csv"  # URL used to download file
#  GET-request to get file content using your web-site's credentials to access file
r = requests.get(url, auth=HTTPBasicAuth("your_username", "your_password"))
#  Saving response content to file on your computer
with open("path/to/folder/to/save/file/filename.csv", 'w') as my_file:
    my_file.write(r.content)

如何使用Python，Selenium和PhantomJS下载文件

问题描述

4 个解决方案

解决方案1
3 已采纳 2016-07-29 09:39:12

解决方案2
1 2016-07-27 14:39:43

解决方案3
1 2016-07-27 14:58:16

解决方案4
0 2016-07-27 14:51:09

如何使用Python，Selenium和PhantomJS下载文件

问题描述

4 个解决方案

解决方案1 3 已采纳 2016-07-29 09:39:12

解决方案2 1 2016-07-27 14:39:43

解决方案3 1 2016-07-27 14:58:16

解决方案4 0 2016-07-27 14:51:09

解决方案1
3 已采纳 2016-07-29 09:39:12

解决方案2
1 2016-07-27 14:39:43

解决方案3
1 2016-07-27 14:58:16

解决方案4
0 2016-07-27 14:51:09