简体   繁体   English

如何从 JavaScript 网站抓取数据?

[英]How do I scrape data from JavaScript website?

I am trying to scrape data from this dynamic JavaScript website .我正在尝试从这个动态 JavaScript 网站中抓取数据。 Since the page is dynamic I am using Selenium to extract the data from the table.由于页面是动态的,我使用 Selenium 从表中提取数据。 Please suggest me how to scrape the data from the dynamic table.请建议我如何从动态表中抓取数据。 Here is my code.这是我的代码。

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests

# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)

# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')

# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()


# find elements by xpath
##results = driver.find_elements_by_xpath("//div[@id='div_taboa']//table[@id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[@id='page-title']")
##results = driver.find_elements_by_xpath("//*[@id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[@id = 'frame_historicos']")
print(results)
print(len(results))


# create empty array to store data
data = []
# loop over results
for result in results:
    heading = result.text
    print(heading)
    headingfind = result.find_element_by_tag_name('h1')
    # append dict to array
    data.append({"head" : headingfind, "name" : heading})
# close driver 
driver.quit()
###################################################################



# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')

I want to extract data from 2005 till present with Averages/Totals of 10 min which gives me data for only one month.我想提取从 2005 年到现在的数据,平均/总计为 10 分钟,这给了我一个月的数据。

  1. Induce WebDriverWait And element_to_be_clickable ()诱导WebDriverWaitelement_to_be_clickable ()
  2. Install Beautiful soup library安装美丽的汤库
  3. Using pandas read_html ()使用pandas read_html ()
  4. I haven't create list.我没有创建列表。 you should create startdate and enddate list and itearte for all those month since 1/1/2005您应该为自 2005 年 1 月 1 日以来的所有月份创建 startdate 和 enddate 列表和1/1/2005

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd from bs4 import BeautifulSoup import time urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical' driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe') driver.get(urlpage) WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos"))) inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[1]"))) inputstartdate.clear() inputstartdate.send_keys("1/1/2005") inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[last()]"))) inputenddate.clear() inputenddate.send_keys("1/31/2005") WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[@class='form-submit'][@value='REFRESH']"))).click() WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa"))) time.sleep(3) soup=BeautifulSoup(driver.page_source,"html.parser") table=soup.find("table", id="taboa") df=pd.read_html(str(table)) df.to_csv('testsot.csv') print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM