使用Selenium和Requests模塊從Python3中的網頁獲取文件

Question

我希望在遇到的問題上能獲得一些幫助。 我對python還是很陌生，並且一直在研究Al Sweigart的“使用Python自動化無聊的東西”，以簡化一些非常繁瑣的工作。

以下是我所遇到的問題的概述：我試圖訪問網頁，並使用Requests和BeautifulSoup模塊來解析該網站，獲取指向我所需文件的URL，然后下載這些文件。 除一個小問題外，該過程非常有效...頁面上有一個ReportDropDown選項，用於過濾顯示的結果。 我遇到的問題是，即使使用新信息更新了網頁結果，網頁URL也不會更改，而我的request.get（）僅從默認過濾器中獲取信息。

因此，為了解決該問題，我嘗試使用Selenium來更改報告選擇...這也非常有用，除了無法從打開的Selenium瀏覽器實例中獲取Requests模塊。

因此，看起來我可以使用Requests和BeautifulSoup來獲取“默認”頁面下拉過濾器的信息，並且可以使用Selenium來更改ReportDropDown選項，但是我無法將這兩件事結合在一起。

第1部分：

#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')

# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
    splitStandards.append(linkElems[link].get('href'))

# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
    j = os.path.basename(item)      # BAL-001-2.pdf, etc...
    f = open(j, 'wb')
    ires = requests.get(item)
    # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
    ires.raise_for_status()
    for chunk in ires.iter_content(1000000):    # 1MB chunks
        f.write(chunk)
    print('Completing download for: ' + str(j) + '.')
    f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

此模式非常有用，除了無法更改ReportDropDown選擇，然后使用請求來提取新頁面信息。 我已經修改了request.get（），requests.post（url，data = {}），selenium-requests等...

第2部分：

使用Selenium似乎很簡單，但是我無法從正確的瀏覽器實例中獲取request.get（）。 另外，我必須制作一個具有一些aboug：config更改的Firefox配置文件（seleniumDefault）...（windows + r，firefox.exe -p）。 更新：about：config更改是為了臨時設置browser.tabs.remote.autostart = True

from selenium import webdriver

# I used 'fp' to use a specific firefox profile
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
browser.get('http://www.nerc.net/standardsreports/standardssummary.aspx')

# There are 5 possible ReportDropDown selections but I only wanted 3 of them (current, future, inactive).
# In the html code, after a selection is made, it reads as: option selected="selected" value="5" -- where 'value' is the selection number

currentElem = browser.find_elements_by_tag_name('option')[0]
futureElem = browser.find_elements_by_tag_name('option')[1]
inactiveElem = browser.find_elements_by_tag_name('option')[4]

# Using the above code line for "browser.get()" and then currentElem.click(), or futureElem.click(), or inactiveElem.click() correctly changes the page selection. Apparently the browser.get() is needed to refresh the page data before making a new option selection.
# Note: changing the ReportDropDown option doesn't alter the page URL path

因此，我的最終問題是，如何選擇頁面並為每個頁面提取適當的數據？

我的首選是為此僅使用請求和bs4模塊，但是如果我要使用硒，那么如何獲得從已打開的硒瀏覽器實例中提取請求的請求？

我已經盡我所能，並且對python還是很陌生，所以將不勝感激。 另外，由於我仍在學習很多，所以任何新手中級水平的解釋都將動搖，謝謝！

================================================== ======

再次感謝您的幫助，它使我越過了擋住我的牆。 這是最終的產品...在獲取信息之前，我必須為所有要加載的內容添加一些sleep語句。

最終版本修訂：

#! python3

# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.

import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

os.chdir('C:\\Standards')

def nercStandards(standardURL):
    logFile = open('_logFile.txt', 'w')
    logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
                  str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
    logFile.close()
    fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
    browser = webdriver.Firefox(fp)
    wait = WebDriverWait(browser, 10)

    currentOption = 'Mandatory Standards Subject to Enforcement'
    futureOption = 'Standards Subject to Future Enforcement'
    inactiveOption = 'Inactive Reliability Standards'

    dropdownList = [currentOption, futureOption, inactiveOption]

    print()
    print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
    for option in dropdownList:
        standardName = []   # Capture all the standard names accurately
        standardLink = []   # Capture all the href links for each standard
        standardDict = {}   # combine the standardName and standardLink into a dictionary 
        browser.get(standardURL)
        dropdown = Select(browser.find_element_by_id("ReportDropDown"))
        dropdown.select_by_visible_text(option)
        wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))

        time.sleep(3)   # Needed for the 'inactive' page to completely load consistently
        page_source = browser.page_source
        soup = bs4.BeautifulSoup(page_source, 'html.parser')
        soupElems = soup.select('.style97 a')

        # standardLink list generated here
        for link in range(len(soupElems)):
            standardLink.append(soupElems[link].get('href'))
            # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf

        # standardName list generated here
        if option == currentOption:
            print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
            currentElems = soup.select('.style99 span[class="style30"]')
            for currentStandard in range(len(currentElems)):
                   standardName.append(currentElems[currentStandard].getText())
                   # BAL-001-2
        elif option == futureOption:
            print()
            print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
            futureElems = soup.select('.style99 span[class="style30"]')
            for futureStandard in range(len(futureElems)):
                   standardName.append(futureElems[futureStandard].getText())
                   # COM-001-3       
        elif option == inactiveOption:
            print()
            print(' Inactive Reliability Standards '.center(80, '.') + '\n')
            inactiveElems = soup.select('.style104 font[face="Verdana"]')
            for inactiveStandard in range(len(inactiveElems)):
                   standardName.append(inactiveElems[inactiveStandard].getText())
                   # BAL-001-0

        # if nunber of names and links match, then create key:value pairs in standardDict
        if len(standardName) == len(standardLink):
            for x in range(len(standardName)):
                standardDict[standardName[x]] = standardLink[x]
        else:
            print('Error: items in standardName and standardLink are not equal!')
            logFile = open('_logFile.txt', 'a')
            logFile.write('\nError: items in standardName and standardLink are not equal!\n')
            logFile.close()

        # URL correction for PRC-005-1b
        # if 'PRC-005-1b' in standardDict:
        #     standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'

        for k, v in standardDict.items():
            logFile = open('_logFile.txt', 'a')
            f = open(k + '.pdf', 'wb')
            ires = requests.get(v)
            try:
                ires.raise_for_status()
                logFile.write(k + '\t\t' + v + '\n')
            except Exception as exc:
                print('\nThere was a problem on %s: \n%s' % (k, exc))
                logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
            for chunk in ires.iter_content(1000000):
                    f.write(chunk)
            f.close()
            logFile.close()
            print(k + ': \n\t' + v)
    print()
    print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')

Answer 1

使用Selenium單擊按鈕等完成工作后，您需要告訴BeautifulSoup使用它：

    page_source = browser.page_source
    link_soup = bs4.BeautifulSoup(page_source,'html.parser')

Answer 2

@HenryM處於正確的軌道，除了在閱讀.page_source並將其傳遞給BeautifulSoup進行進一步分析之前，您需要確保已加載所需的數據。 為此，請使用WebDriverWait類。

例如，選擇“已提交標准並待審批標准”選項后，您需要等待報告標題被更新-這將向您指示已加載新結果。 遵循以下原則：

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

# ...

wait = WebDriverWait(browser, 10)

option_text = "Standards Filed and Pending Regulatory Approval" 

# select the dropdown value
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option_text)

# wait for results to be loaded
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#panel5 > div > span"), option_text)

soup = BeautifulSoup(browser.page_source,'html.parser')
# TODO: parse the results

還要注意使用Select類來操縱下拉菜單。

使用Selenium和Requests模塊從Python3中的網頁獲取文件

問題描述

2 個解決方案

解決方案1
1 2017-01-07 16:31:35

解決方案2
1 2017-01-07 18:45:49

使用Selenium和Requests模塊從Python3中的網頁獲取文件

問題描述

2 個解決方案

解決方案1 1 2017-01-07 16:31:35

解決方案2 1 2017-01-07 18:45:49

解決方案1
1 2017-01-07 16:31:35

解決方案2
1 2017-01-07 18:45:49