![](/img/trans.png)
[英]How to wait before screenshot of webpage using selenium and phantomjs in python3?
[英]Using Selenium and Requests module to get files from webpage in Python3
我希望在遇到的問題上能獲得一些幫助。 我對python還是很陌生,並且一直在研究Al Sweigart的“使用Python自動化無聊的東西”,以簡化一些非常繁瑣的工作。
以下是我所遇到的問題的概述:我試圖訪問網頁,並使用Requests和BeautifulSoup模塊來解析該網站,獲取指向我所需文件的URL,然后下載這些文件。 除一個小問題外,該過程非常有效...頁面上有一個ReportDropDown選項,用於過濾顯示的結果。 我遇到的問題是,即使使用新信息更新了網頁結果,網頁URL也不會更改,而我的request.get()僅從默認過濾器中獲取信息。
因此,為了解決該問題,我嘗試使用Selenium來更改報告選擇...這也非常有用,除了無法從打開的Selenium瀏覽器實例中獲取Requests模塊。
因此,看起來我可以使用Requests和BeautifulSoup來獲取“默認”頁面下拉過濾器的信息,並且可以使用Selenium來更改ReportDropDown選項,但是我無法將這兩件事結合在一起。
第1部分:
#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')
# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
splitStandards.append(linkElems[link].get('href'))
# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
j = os.path.basename(item) # BAL-001-2.pdf, etc...
f = open(j, 'wb')
ires = requests.get(item)
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
ires.raise_for_status()
for chunk in ires.iter_content(1000000): # 1MB chunks
f.write(chunk)
print('Completing download for: ' + str(j) + '.')
f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
此模式非常有用,除了無法更改ReportDropDown選擇,然后使用請求來提取新頁面信息。 我已經修改了request.get(),requests.post(url,data = {}),selenium-requests等...
第2部分:
使用Selenium似乎很簡單,但是我無法從正確的瀏覽器實例中獲取request.get()。 另外,我必須制作一個具有一些aboug:config更改的Firefox配置文件(seleniumDefault)...(windows + r,firefox.exe -p)。 更新:about:config更改是為了臨時設置browser.tabs.remote.autostart = True
from selenium import webdriver
# I used 'fp' to use a specific firefox profile
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
browser.get('http://www.nerc.net/standardsreports/standardssummary.aspx')
# There are 5 possible ReportDropDown selections but I only wanted 3 of them (current, future, inactive).
# In the html code, after a selection is made, it reads as: option selected="selected" value="5" -- where 'value' is the selection number
currentElem = browser.find_elements_by_tag_name('option')[0]
futureElem = browser.find_elements_by_tag_name('option')[1]
inactiveElem = browser.find_elements_by_tag_name('option')[4]
# Using the above code line for "browser.get()" and then currentElem.click(), or futureElem.click(), or inactiveElem.click() correctly changes the page selection. Apparently the browser.get() is needed to refresh the page data before making a new option selection.
# Note: changing the ReportDropDown option doesn't alter the page URL path
因此,我的最終問題是,如何選擇頁面並為每個頁面提取適當的數據?
我的首選是為此僅使用請求和bs4模塊,但是如果我要使用硒,那么如何獲得從已打開的硒瀏覽器實例中提取請求的請求?
我已經盡我所能,並且對python還是很陌生,所以將不勝感激。 另外,由於我仍在學習很多,所以任何新手中級水平的解釋都將動搖,謝謝!
================================================== ======
再次感謝您的幫助,它使我越過了擋住我的牆。 這是最終的產品...在獲取信息之前,我必須為所有要加載的內容添加一些sleep語句。
最終版本修訂:
#! python3
# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.
import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
os.chdir('C:\\Standards')
def nercStandards(standardURL):
logFile = open('_logFile.txt', 'w')
logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
logFile.close()
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
wait = WebDriverWait(browser, 10)
currentOption = 'Mandatory Standards Subject to Enforcement'
futureOption = 'Standards Subject to Future Enforcement'
inactiveOption = 'Inactive Reliability Standards'
dropdownList = [currentOption, futureOption, inactiveOption]
print()
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for option in dropdownList:
standardName = [] # Capture all the standard names accurately
standardLink = [] # Capture all the href links for each standard
standardDict = {} # combine the standardName and standardLink into a dictionary
browser.get(standardURL)
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option)
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))
time.sleep(3) # Needed for the 'inactive' page to completely load consistently
page_source = browser.page_source
soup = bs4.BeautifulSoup(page_source, 'html.parser')
soupElems = soup.select('.style97 a')
# standardLink list generated here
for link in range(len(soupElems)):
standardLink.append(soupElems[link].get('href'))
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
# standardName list generated here
if option == currentOption:
print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
currentElems = soup.select('.style99 span[class="style30"]')
for currentStandard in range(len(currentElems)):
standardName.append(currentElems[currentStandard].getText())
# BAL-001-2
elif option == futureOption:
print()
print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
futureElems = soup.select('.style99 span[class="style30"]')
for futureStandard in range(len(futureElems)):
standardName.append(futureElems[futureStandard].getText())
# COM-001-3
elif option == inactiveOption:
print()
print(' Inactive Reliability Standards '.center(80, '.') + '\n')
inactiveElems = soup.select('.style104 font[face="Verdana"]')
for inactiveStandard in range(len(inactiveElems)):
standardName.append(inactiveElems[inactiveStandard].getText())
# BAL-001-0
# if nunber of names and links match, then create key:value pairs in standardDict
if len(standardName) == len(standardLink):
for x in range(len(standardName)):
standardDict[standardName[x]] = standardLink[x]
else:
print('Error: items in standardName and standardLink are not equal!')
logFile = open('_logFile.txt', 'a')
logFile.write('\nError: items in standardName and standardLink are not equal!\n')
logFile.close()
# URL correction for PRC-005-1b
# if 'PRC-005-1b' in standardDict:
# standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'
for k, v in standardDict.items():
logFile = open('_logFile.txt', 'a')
f = open(k + '.pdf', 'wb')
ires = requests.get(v)
try:
ires.raise_for_status()
logFile.write(k + '\t\t' + v + '\n')
except Exception as exc:
print('\nThere was a problem on %s: \n%s' % (k, exc))
logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
for chunk in ires.iter_content(1000000):
f.write(chunk)
f.close()
logFile.close()
print(k + ': \n\t' + v)
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')
使用Selenium單擊按鈕等完成工作后,您需要告訴BeautifulSoup使用它:
page_source = browser.page_source
link_soup = bs4.BeautifulSoup(page_source,'html.parser')
@HenryM處於正確的軌道,除了在閱讀.page_source
並將其傳遞給BeautifulSoup
進行進一步分析之前,您需要確保已加載所需的數據。 為此,請使用WebDriverWait
類 。
例如,選擇“已提交標准並待審批標准”選項后,您需要等待報告標題被更新-這將向您指示已加載新結果。 遵循以下原則:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
# ...
wait = WebDriverWait(browser, 10)
option_text = "Standards Filed and Pending Regulatory Approval"
# select the dropdown value
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option_text)
# wait for results to be loaded
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#panel5 > div > span"), option_text)
soup = BeautifulSoup(browser.page_source,'html.parser')
# TODO: parse the results
還要注意使用Select
類來操縱下拉菜單。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.