[英]Scraping with Beautiful Soup does not update values properly
我尝试网络抓取天气网站,但数据没有正确更新。 编码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
while True:
soup = BeautifulSoup(urlopen(url), 'html.parser')
data = soup.find("div", {"class": "weather__text"})
print(data.text)
我正在查看“当前条件”部分中的“风和风”。 它正确打印第一个值(例如 1.0 / 2.2 mph),但之后这些值更新非常缓慢(有时经过 5 分钟以上),即使它们在网站中每 10-20-30 秒更改一次。
当 Python 中的值更新时,它们仍然与网站中的当前值不同。
您可以尝试这种替代方法:由于该站点实际上是从另一个 url 检索数据,因此您可以直接发出请求并每隔一小时左右抓取一次站点以更新请求 url。
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from datetime import datetime, timedelta
#def getReqUrl...
reqUrl = getReqUrl()
prevTime, prevAt = '', datetime.now()
while True:
ures = json.loads(urlopen(reqUrl).read())
if 'observations' not in asd:
reqUrl = getReqUrl()
ures = json.loads(urlopen(reqUrl).read())
#to see time since last update
obvTime = ures['observations'][0]['obsTimeUtc']
td = (datetime.now() - prevAt).seconds
wSpeed = ures['observations'][0]['imperial']['windSpeed']
wGust = ures['observations'][0]['imperial']['windGust']
print('',end=f'\r[+{td}s -> {obvTime}]: {wGust} ° / {wSpeed} °mph')
if prevTime < obvTime:
prevTime = obvTime
prevAt = datetime.now()
print('')
即使直接发出请求,检索到的数据中的“观察时间”有时也会跳来跳去,这就是为什么我只在obvTime
增加时才在新行上打印 - 没有它,它看起来像这样。 (如果这是首选,您可以在没有'',end='\r...
格式的情况下正常打印,并且不再需要第二个if
块)。
第一个if
块用于刷新reqUrl
(因为它会在一段时间后过期),这是我实际抓取 wunderground 站点的时候,因为 url 在他们的一个script
标签中:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
nxtSt = json.loads(appText.replace('&q;','"'))['wu-next-state-key']
return [
ns for ns in nxtSt.values()
if 'observations' in ns['value'] and
len(ns['value']['observations']) == 1
][0]['url'].replace('&a;','&')
或者,因为我知道 url 是如何启动的,所以更简单的是:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
rUrl = 'https://api.weather.com/v2/pws/observations/current'
rUrl = rUrl + appText.split(rUrl)[1].split('&q;')[0]
return rUrl.replace('&a;','&')
尝试下一个示例:
from bs4 import BeautifulSoup
import requests
url= 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
d = {soup.select('div[class="weather__data weather__wind-gust"] div')[0].text:soup.select('div[class="weather__data weather__wind-gust"] div')[1].text.replace('\xa0°','')}
print(d)
Output:
{' WIND & GUST ': '1.1 / 2.2mph'}
Selenium和bs4抓取动态值:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url= 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
driver.get(url)
driver.maximize_window()
time.sleep(3)
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
d = {soup.select('div[class="weather__data weather__wind-gust"] div')[0].text:soup.select('div[class="weather__data weather__wind-gust"] div')[1].text.replace('\xa0°','')}
print(d)
要获取更新值,您可以将 selenium 与 bs4 一起使用,因为更新值由 JS 动态加载,并且要呈现动态内容,您可以将 selenium 与 bs4 一起使用。
Output:
{' WIND & GUST ': '1.3 / 2.2mph'}
尝试:
import requests
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]
print(wind_gust)
[1.8, 2.2]
wind = wind_gust[0]
gust = wind_gust[1]
print(wind)
1.8
print(gust)
2.2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.