简体   繁体   English

在Python中设计简单的网页抓取工具时遇到的问题

[英]Issues designing a simple web scraper in Python

I have followed an online tutorial and successfully created a web scraper identical to that when following it step by step. 我遵循了一个在线教程,并成功地创建了一个与逐步遵循相同的Web抓取工具。

However, when trying to implement this code on my desired website, blank data is all that is being returned on my console. 但是,当尝试在所需的网站上实现此代码时,控制台上返回的所有数据都是空白。 I was hoping someone could look at the short code I have put down to gather the data and see if I have done this correctly, or am I unaware of some protocol on the website that will not allow for data to be scraped from it. 我希望有人可以查看我放下的短代码来收集数据,看看我是否正确地完成了此操作,或者我是否意识到网站上的某些协议不允许从中擦除数据。

# import libraries
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

myurl = "http://smartgriddashboard.eirgrid.com/#all/generation"

# opening up connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# find the data of interest
key_stats = page_soup.findAll("div",{"class":"key-stats-container"})

Once I then try to call the key_stats, all that appears is []. 然后,当我尝试调用key_stats时,出现的全部是[]。 As I said before, when doing this on a sample web page on an online tutorial, all of the data within that class was stored. 如前所述,在在线教程的示例网页上执行此操作时,该类中的所有数据都已存储。

I am not a programmer by profession and all of this is very new to me so any and all assistance would be hugely appreciated. 我不是专业的程序员,所有这一切对我来说都是很新的,因此,感谢您的协助。

The issue is that the div you are trying to scrape from the page is generated dynamically using Javascript. 问题在于,您尝试从页面抓取的div是使用Javascript动态生成的。 Its not in the HTML source code , which means urllib.request doesn't have access to that information. 它不在HTML源代码中 ,这意味着urllib.request无法访问该信息。 When you load the page in your browser, you should notice that that information isn't immediately on the screen, the statistics appear a few seconds after the page loads. 在浏览器中加载页面时,您应该注意到该信息并不立即显示在屏幕上,统计信息会在页面加载后几秒钟出现。

You can either try to look through the Javascript or Source files for the website, and try to find where the information is coming from (usually JSON or XML files), or use something like selenium (an automated browser) to parse the page after relevant elements are on the page: 您可以尝试浏览网站的Javascript或Source文件,并尝试查找信息的来源 (通常是JSON或XML文件),也可以使用selenium之类的东西(自动浏览器)在相关信息之后解析页面。元素在页面上:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

try:
    driver.get("http://smartgriddashboard.eirgrid.com/#all/generation") # load the page
    WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.key-stats-container > .stat-box'))) # wait till relevant elements are on the page
except:
    driver.quit() # quit if there was an error getting the page or we've waited 15 seconds and the stats haven't appeared.
stat_elements = driver.find_elements_by_css_selector('.key-stats-container > .stat-box')
for el in stat_elements: 
    print(el.find_element_by_css_selector('label').text)
    print(el.find_element_by_css_selector('p').text)
driver.quit()                                      

WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.key-stats-container > .stat-box'))) will wait for 15 seconds or until it finds an element by the css selector before timing out, you can change the 15 seconds as you want. WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.key-stats-container > .stat-box')))将等待15秒钟或until通过CSS选择器找到元素until超时,您可以根据需要更改15秒。

Instead of just waiting for .key-stats-container , I waited for .key-stats-container > .stat-box (an element with a class stats-box which is a direct child of .key-stats-container ) since there was a point at which .key-stats-container had loaded but the statistics hadn't: 我在那里等待,而不是仅仅等待.key-stats-container ,而是等待.key-stats-container > .stat-box (具有stats-box类的元素,它是.key-stats-container的直接子元素),因为在那里这是.key-stats-container已加载的点,但统计数据却没有:

   <span class="load"></span>
    <div class="error-msg">
        <p>We had some trouble gathering the data.</p>
        <p>Refresh to try again.</p>
    </div>
</div>

Here is the output: 这是输出:

LATEST SYSTEM
GENERATION
4,885 MW
THERMAL GENERATION
(COAL, GAS, OTHER)
56.81 %
RENEWABLE
GENERATION
43.03 %
NET
IMPORT
0.16 %

It doesn't look like the whole page is being downloaded. 看起来好像整个页面都没有下载。 You can check this with print(page_soup.prettify()) . 您可以使用print(page_soup.prettify())

A way around this is to use Selenium to open up a web browser and then download the page: 一种解决方法是使用Selenium打开Web浏览器,然后下载页面:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(profile)
driver.get('http://smartgriddashboard.eirgrid.com/#all/generation')
page_soup = soup(driver.page_source, 'html.parser')
  • Note that Selenium needs geckodriver to be installed. 请注意,Selenium需要安装geckodriver
  • I'm sure there's a better way, using Requests or something else. 我敢肯定有更好的方法,使用Requests或其他方法。
  • A super simple way is to get the page source by right clicking on your web browser and then getting Beautiful Soup to use that. 一种超级简单的方法是通过右键单击Web浏览器,然后获取Beautiful Soup来使用它来获取页面源。

On a side note, while it works, your findAll seems to be the old method. 顺便说一句,虽然它起作用了,但是您的findAll似乎是旧方法。 The new method or CSS selectors are probably better. 新方法CSS选择器可能更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM