簡體   English   中英

如何從 JavaScript 網站抓取數據?

[英]How do I scrape data from JavaScript website?

我正在嘗試從這個動態 JavaScript 網站中抓取數據。 由於頁面是動態的,我使用 Selenium 從表中提取數據。 請建議我如何從動態表中抓取數據。 這是我的代碼。

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests

# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)

# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')

# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()


# find elements by xpath
##results = driver.find_elements_by_xpath("//div[@id='div_taboa']//table[@id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[@id='page-title']")
##results = driver.find_elements_by_xpath("//*[@id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[@id = 'frame_historicos']")
print(results)
print(len(results))


# create empty array to store data
data = []
# loop over results
for result in results:
    heading = result.text
    print(heading)
    headingfind = result.find_element_by_tag_name('h1')
    # append dict to array
    data.append({"head" : headingfind, "name" : heading})
# close driver 
driver.quit()
###################################################################



# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')

我想提取從 2005 年到現在的數據,平均/總計為 10 分鍾,這給了我一個月的數據。

  1. 誘導WebDriverWaitelement_to_be_clickable ()
  2. 安裝美麗的湯庫
  3. 使用pandas read_html ()
  4. 我沒有創建列表。 您應該為自 2005 年 1 月 1 日以來的所有月份創建 startdate 和 enddate 列表和1/1/2005

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd from bs4 import BeautifulSoup import time urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical' driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe') driver.get(urlpage) WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos"))) inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[1]"))) inputstartdate.clear() inputstartdate.send_keys("1/1/2005") inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[last()]"))) inputenddate.clear() inputenddate.send_keys("1/31/2005") WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[@class='form-submit'][@value='REFRESH']"))).click() WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa"))) time.sleep(3) soup=BeautifulSoup(driver.page_source,"html.parser") table=soup.find("table", id="taboa") df=pd.read_html(str(table)) df.to_csv('testsot.csv') print(df)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM