[英]Scraping a specific website with a search box and javascripts in Python
在網站https://sray.arabesque.com/dashboard上,html 中有一個搜索框“輸入”。 我想在搜索框中輸入公司名稱,在下拉菜單中選擇該名稱的第一個建議(例如,“Anglo American plc”),go 到 url 以及有關該公司的信息,加載 javascripts 以獲取完整的 Z34D1F91FB2E514B8576FAB1A75A89A6D獲取頁面的版本,然后在底部刮取GC Score,ESG Score,Temperature Score。
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)
companies = ['Anglo American plc']
for company in companies:
# dryscrape.start_xvfb()
# session = dryscrape.Session()
# session.visit("https://srayapi.arabesque.com/api/sray/company/history/004BTP-E")
resp = wd.get('https://sray.arabesque.com/dashboard/')
#print(driver.page_source)
e = wd.find_element_by_id(id_='mat-input-0')
e.send_keys(company)
e.send_keys(Keys.ENTER)
innerHTML = e.execute_script("return document.body.innerHTML")
print(innerHTML)
如果我們在搜索框中輸入公司名稱后不知道 URL,我不太明白如何訪問帶有英美資源集團信息的 URL 並抓取它。
您可以使用 selenium 來做到這一點。您需要更新一些東西。
在進行無頭交互時,您需要提供window size
。
引入WebDriverWait
() 以避免同步問題。
代碼:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
options.add_argument('window-size=1920,1080')
wd = webdriver.Chrome(options=options)
companies = ['Anglo American plc']
for company in companies:
wd.get('https://sray.arabesque.com/dashboard/')
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='list']"))).click()
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='mat-input-0']"))).send_keys(company)
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[contains(.,' Anglo American plc ')]"))).click()
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//span[normalize-space(.)='Open dashboard'])[1]"))).click()
WebDriverWait(wd,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div.mat-tab-labels")))
print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'GC Score')]/span").text)
print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'ESG Score')]/span").text)
print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'Temp')]/span").text)
Output :
57.03
53.78
2.7°C
在不完全了解您為什么要使用 selenium 的情況下,使用搜索然后獲取另一個站點,我將執行以下操作來獲取您正在尋找的數據:
import requests
import json
session = requests.Session()
url = 'https://srayapi.arabesque.com/api/sray/q'
response = session.get(url).json()
rays = response['data']['rays']
[ray for ray in rays if ray['name'].startswith('Anglo American')]
然后做任何你想做的事情,所以對於esg , gc和temperature可能:
myObj = [{result['name']: {'gc': result['gc'], 'esg': result['esg'], 'temp': result['score_near']}} for result in results]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.