简体   繁体   中英

Issues designing a simple web scraper in Python

I have followed an online tutorial and successfully created a web scraper identical to that when following it step by step.

However, when trying to implement this code on my desired website, blank data is all that is being returned on my console. I was hoping someone could look at the short code I have put down to gather the data and see if I have done this correctly, or am I unaware of some protocol on the website that will not allow for data to be scraped from it.

# import libraries
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

myurl = "http://smartgriddashboard.eirgrid.com/#all/generation"

# opening up connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# find the data of interest
key_stats = page_soup.findAll("div",{"class":"key-stats-container"})

Once I then try to call the key_stats, all that appears is []. As I said before, when doing this on a sample web page on an online tutorial, all of the data within that class was stored.

I am not a programmer by profession and all of this is very new to me so any and all assistance would be hugely appreciated.

The issue is that the div you are trying to scrape from the page is generated dynamically using Javascript. Its not in the HTML source code , which means urllib.request doesn't have access to that information. When you load the page in your browser, you should notice that that information isn't immediately on the screen, the statistics appear a few seconds after the page loads.

You can either try to look through the Javascript or Source files for the website, and try to find where the information is coming from (usually JSON or XML files), or use something like selenium (an automated browser) to parse the page after relevant elements are on the page:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

try:
    driver.get("http://smartgriddashboard.eirgrid.com/#all/generation") # load the page
    WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.key-stats-container > .stat-box'))) # wait till relevant elements are on the page
except:
    driver.quit() # quit if there was an error getting the page or we've waited 15 seconds and the stats haven't appeared.
stat_elements = driver.find_elements_by_css_selector('.key-stats-container > .stat-box')
for el in stat_elements: 
    print(el.find_element_by_css_selector('label').text)
    print(el.find_element_by_css_selector('p').text)
driver.quit()                                      

WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.key-stats-container > .stat-box'))) will wait for 15 seconds or until it finds an element by the css selector before timing out, you can change the 15 seconds as you want.

Instead of just waiting for .key-stats-container , I waited for .key-stats-container > .stat-box (an element with a class stats-box which is a direct child of .key-stats-container ) since there was a point at which .key-stats-container had loaded but the statistics hadn't:

   <span class="load"></span>
    <div class="error-msg">
        <p>We had some trouble gathering the data.</p>
        <p>Refresh to try again.</p>
    </div>
</div>

Here is the output:

LATEST SYSTEM
GENERATION
4,885 MW
THERMAL GENERATION
(COAL, GAS, OTHER)
56.81 %
RENEWABLE
GENERATION
43.03 %
NET
IMPORT
0.16 %

It doesn't look like the whole page is being downloaded. You can check this with print(page_soup.prettify()) .

A way around this is to use Selenium to open up a web browser and then download the page:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(profile)
driver.get('http://smartgriddashboard.eirgrid.com/#all/generation')
page_soup = soup(driver.page_source, 'html.parser')
  • Note that Selenium needs geckodriver to be installed.
  • I'm sure there's a better way, using Requests or something else.
  • A super simple way is to get the page source by right clicking on your web browser and then getting Beautiful Soup to use that.

On a side note, while it works, your findAll seems to be the old method. The new method or CSS selectors are probably better.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM