简体   繁体   English

BeautifulSoup 号码提取

[英]BeautifulSoup Number Extraction

``So im trying to get the degrees from this weather site. ``所以我试图从这个天气网站获得学位。 But it keeps returning a blank answer.但它一直返回一个空白的答案。 This is my code Link to a screenshot这是我的代码截图链接

import requests
from bs4 import BeautifulSoup

# -----------------------------get site info------------------------------- #


URL = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
request = requests.get(URL)
# print(request.content)

# ----------------------parse site info---------------- #

soup = BeautifulSoup(request.content, 'html5lib')

#print(soup.prettify().encode("utf-8"))

weatherdata = soup.find('span', class_='temp')

print(weatherdata)

It might be that those values are rendered dynamically ie the values might be populated by javascript in the page.这些值可能是动态呈现的,即这些值可能由页面中的 javascript 填充。

requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting. requests.get()只是简单地返回从服务器接收到的标记,而不需要任何进一步的客户端更改,因此它并不完全是等待。

You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source.您也许可以使用Selenium Chrome Webdriver加载页面 URL 并获取页面源。 (Or you can use Firefox driver). (或者您可以使用 Firefox 驱动程序)。

Go to chrome://settings/help check your current chrome version and download the driver for that version from here . Go 到chrome://settings/help检查您当前的 chrome 版本并从 此处下载该版本的驱动程序。 Make sure to either keep the driver file in your PATH or the same folder where your python script is.确保将驱动程序文件保存在您的PATH或 python 脚本所在的同一文件夹中。

Try this:尝试这个:

from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options

url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"

#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()  
chrome_options.add_argument("--headless")

# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
     browser.get(url)
     page_source = browser.page_source

#Parse the final page source
soup = bs(page_source, 'html.parser')

weatherdata = soup.find('span', class_='temp')

print(weatherdata.text)
10

References:参考:

Get page generated with Javascript in Python 获取使用 Python 中的 Javascript 生成的页面

selenium - chromedriver executable needs to be in PATH selenium - chromedriver 可执行文件需要在 PATH 中

Problem seems to be that the data is loaded via JavaScript so it takes a while to load the value for that specific span.问题似乎是数据是通过 JavaScript 加载的,因此加载该特定跨度的值需要一段时间。 When you do your request it seems to be empty and only loads in after a bit.当您执行请求时,它似乎是空的,并且仅在稍后加载。 One possible solution to this would be using selenium to wait for the page to load and then extract html afterwards.一种可能的解决方案是使用 selenium 等待页面加载,然后提取 html。

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'html.parser')
elem = soup.find('span', class_='temp')

print(elem.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM