简体   繁体   中英

How do I scrape data from JavaScript website?

I am trying to scrape data from this dynamic JavaScript website . Since the page is dynamic I am using Selenium to extract the data from the table. Please suggest me how to scrape the data from the dynamic table. Here is my code.

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests

# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)

# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')

# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()


# find elements by xpath
##results = driver.find_elements_by_xpath("//div[@id='div_taboa']//table[@id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[@id='page-title']")
##results = driver.find_elements_by_xpath("//*[@id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[@id = 'frame_historicos']")
print(results)
print(len(results))


# create empty array to store data
data = []
# loop over results
for result in results:
    heading = result.text
    print(heading)
    headingfind = result.find_element_by_tag_name('h1')
    # append dict to array
    data.append({"head" : headingfind, "name" : heading})
# close driver 
driver.quit()
###################################################################



# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')

I want to extract data from 2005 till present with Averages/Totals of 10 min which gives me data for only one month.

  1. Induce WebDriverWait And element_to_be_clickable ()
  2. Install Beautiful soup library
  3. Using pandas read_html ()
  4. I haven't create list. you should create startdate and enddate list and itearte for all those month since 1/1/2005

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd from bs4 import BeautifulSoup import time urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical' driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe') driver.get(urlpage) WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos"))) inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[1]"))) inputstartdate.clear() inputstartdate.send_keys("1/1/2005") inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[last()]"))) inputenddate.clear() inputenddate.send_keys("1/31/2005") WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[@class='form-submit'][@value='REFRESH']"))).click() WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa"))) time.sleep(3) soup=BeautifulSoup(driver.page_source,"html.parser") table=soup.find("table", id="taboa") df=pd.read_html(str(table)) df.to_csv('testsot.csv') print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM