简体   繁体   中英

Scraping page with javascript filled table

I'm trying to scrape this page to get generation data to pass to a parser later on.

My problem is that the table is populated by multiple scripts that make requests to another server. Beautiful Soup scrapes the page but returns the javascript unexecuted. So I'm trying to use selenium to open the page in a browser then scrape the populated table.

When I run my code Firefox loads the page then closes, but BS still returns the page without the table being populated. I've tried inspecting the page using web console once fully loaded and I can see the data I need ie one data point is contained in a div tag with class = "r11". A search for this tag returns None.

My thoughts are that either a) I'm using selenium wrong or b) the page's formatting is throwing things off since it looks to be quite deeply nested with serveral "sub documents" (not sure of correct term).

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

arg_therm = ('http://portalweb.cammesa.com/MEMNet1/Pages/Informes%20por%20'
        'Categor%C3%ADa/Operativos/VisorReporteSinComDesp_minimal.asp'
        'x?hora=0&titulo=Despacho%20Generacion%20Termica&reportPath='
        'http://lauzet:5000/MemNet1/ReportingServices/Despacho'
        'GeneracionTermica.rdl--0--Despacho+Generaci%c3%b3n+T%c3%a9rmica')


browser = webdriver.Firefox()  
browser.get(arg_therm)  
html_source = browser.page_source  

browser.quit()

soup=BeautifulSoup(html_source,'lxml')

print(soup.prettify())

print(soup.find('div', {"class": "r11"}))

Try to use below code to get required table:

from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

browser = webdriver.Firefox()  
browser.get(arg_therm)

browser.switch_to.frame(browser.find_element_by_xpath('//iframe[starts-with(@name, "RportFramectl00")]'))
browser.switch_to.frame('report')

table_cells = wait(browser, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "r11")))
for cell in table_cells:
    print(cell.text)

this should wait for appearance of required elements and return you list of those DIVs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM