如何使用Python，Selenium和PhantomJS下載文件

Question

這是我的情況：我必須登錄到一個網站並從那里下載CSV，而從linux服務器無頭。 該頁面使用JS，沒有它就無法工作。

經過一些研究，我選擇了Selenium和PhantomJS。 登錄，設置CSV參數並使用Selenium / PhantomJS / Py3找到下載按鈕沒有問題，實際上令人驚訝。

但是單擊下載按鈕沒有任何作用。 經過一些研究，我發現PhantomJS似乎不支持下載對話框和下載，但它即將出現在功能列表中。

因此，我發現下載按鈕只是調用REST API網址后，以為我對urllib使用了一種解決方法。 問題是，僅當您登錄到站點時它才起作用。 所以第一次嘗試失敗了，因為它返回了： b'{"success":false,"session":"expired"}'這很有意義，因為我希望Selenium和urllib使用不同的會話。 所以我想我在嘗試使用urrlib中Seleniums驅動程序的標頭：

...
url = 'http://www.foo.com/api/index'
data = urllib.parse.urlencode({
        'foopara': 'cadbrabar',
    }).encode('utf-8')
headers = {}
for cookie in driver.get_cookies():
    headers[cookie['name']] = cookie['value']
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    page = response.read()
driver.close()

不幸的是，這產生了過期會話的相同結果。 我做錯什么了嗎，還有其他建議可以解決嗎？還是我走到了盡頭？ 提前致謝。

Answer 1

我找到了一個解決方案，並希望分享。 一項要求發生了變化，我不再使用PhantomJS ，而是使用與虛擬幀緩沖區chromedriver 。 同樣的結果，它完成了工作。

您需要的是：

pip install selenium pyvirtualdisplay

apt-get install xvfb

下載ChromeDriver

我將Py3.5和ovh.net的測試文件與標簽而不是按鈕一起使用。 腳本等待頁面上顯示，然后單擊它。 如果您不等待該元素並且位於異步站點上，那么您嘗試單擊的元素可能還不存在。 下載位置是相對於腳本位置的文件夾。 該腳本會檢查該目錄，如果文件已被第二次延遲下載。 如果我沒有記錯的話，在下載過程中文件應該是.part，並且一旦它成為filename指定的.dat，腳本便會完成。 如果關閉虛擬幀緩沖區和驅動程序，下載將無法完成。 完整的腳本如下所示：

# !/usr/bin/python
# coding: utf-8

import os
import sys
import time
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import glob


def main(argv):
    url = 'http://ovh.net/files'
    dl_dir = 'downloads'
    filename = '1Mio.dat'

    display = Display(visible=0, size=(800, 600))
    display.start()

    chrome_options = webdriver.ChromeOptions()
    dl_location = os.path.join(os.getcwd(), dl_dir)

    prefs = {"download.default_directory": dl_location}
    chrome_options.add_experimental_option("prefs", prefs)
    chromedriver = "./chromedriver"
    driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)

    driver.set_window_size(800, 600)
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))

    hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
    hyperlink.click()

    while not(glob.glob(os.path.join(dl_location, filename))):
        time.sleep(1)

    driver.close()
    display.stop()

if __name__ == '__main__':
    main(sys.argv)

我希望這對以后的人有所幫助。

Answer 2

如果您要下載的按鈕具有文件鏈接，則您可以使用python代碼測試下載它，因為PhantonJs本身並不支持下載。 因此，如果您的下載按鈕不提供文件鏈接，則無法進行測試。

要使用文件鏈接和phyton進行測試（以斷言該文件存在），可以按照本主題進行操作。 因為我是C＃開發人員和Testes，所以我不知道用python編寫沒有錯誤的代碼的更好方法，但是我確定您可以：

基本的HTTP文件下載並保存到python中的磁盤上？

Answer 3

我最近使用Selenium來利用ChromeDriver從網絡上下載文件。 之所以可行，是因為Chrome瀏覽器會自動下載文件並將其存儲在“下載”文件中。 這比使用PhantomJS容易。

我建議研究將ChromeDriver與Selenium結合使用並遵循以下路線： https : //github.com/SeleniumHQ/selenium/wiki/ChromeDriver

編輯-正如下面所指出的，我忽略了指向如何設置ChromeDriver以無頭模式運行的方法。 這是更多信息： http : //www.chrisle.me/2013/08/running-headless-selenium-with-chrome/

或： https : //gist.github.com/chuckbutler/8030755

Answer 4

您可以嘗試如下操作：

from requests.auth import HTTPBasicAuth
import requests

url = "http://some_site/files?file=file.csv"  # URL used to download file
#  GET-request to get file content using your web-site's credentials to access file
r = requests.get(url, auth=HTTPBasicAuth("your_username", "your_password"))
#  Saving response content to file on your computer
with open("path/to/folder/to/save/file/filename.csv", 'w') as my_file:
    my_file.write(r.content)

如何使用Python，Selenium和PhantomJS下載文件

問題描述

4 個解決方案

解決方案1
3 已采納 2016-07-29 09:39:12

解決方案2
1 2016-07-27 14:39:43

解決方案3
1 2016-07-27 14:58:16

解決方案4
0 2016-07-27 14:51:09

如何使用Python，Selenium和PhantomJS下載文件

問題描述

4 個解決方案

解決方案1 3 已采納 2016-07-29 09:39:12

解決方案2 1 2016-07-27 14:39:43

解決方案3 1 2016-07-27 14:58:16

解決方案4 0 2016-07-27 14:51:09

解決方案1
3 已采納 2016-07-29 09:39:12

解決方案2
1 2016-07-27 14:39:43

解決方案3
1 2016-07-27 14:58:16

解決方案4
0 2016-07-27 14:51:09