How do I scrape data from JavaScript website?

Question

I am trying to scrape data from this dynamic JavaScript website . Since the page is dynamic I am using Selenium to extract the data from the table. Please suggest me how to scrape the data from the dynamic table. Here is my code.

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests

# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)

# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')

# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()


# find elements by xpath
##results = driver.find_elements_by_xpath("//div[@id='div_taboa']//table[@id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[@id='page-title']")
##results = driver.find_elements_by_xpath("//*[@id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[@id = 'frame_historicos']")
print(results)
print(len(results))


# create empty array to store data
data = []
# loop over results
for result in results:
    heading = result.text
    print(heading)
    headingfind = result.find_element_by_tag_name('h1')
    # append dict to array
    data.append({"head" : headingfind, "name" : heading})
# close driver 
driver.quit()
###################################################################



# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')

I want to extract data from 2005 till present with Averages/Totals of 10 min which gives me data for only one month.

Answer 1

Induce WebDriverWait And element_to_be_clickable ()
Install Beautiful soup library
Using pandas read_html ()

I haven't create list. you should create startdate and enddate list and itearte for all those month since 1/1/2005

 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd from bs4 import BeautifulSoup import time urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical' driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe') driver.get(urlpage) WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos"))) inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[1]"))) inputstartdate.clear() inputstartdate.send_keys("1/1/2005") inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[last()]"))) inputenddate.clear() inputenddate.send_keys("1/31/2005") WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[@class='form-submit'][@value='REFRESH']"))).click() WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa"))) time.sleep(3) soup=BeautifulSoup(driver.page_source,"html.parser") table=soup.find("table", id="taboa") df=pd.read_html(str(table)) df.to_csv('testsot.csv') print(df)

How do I scrape data from JavaScript website?

Question

1 answers

solution1
1 ACCPTED 2019-11-12 16:28:09

How do I scrape data from JavaScript website?

Question

1 answers

solution1 1 ACCPTED 2019-11-12 16:28:09

solution1
1 ACCPTED 2019-11-12 16:28:09