简体   繁体   English

Web 用Python BS刮

[英]Web Scraping with Python BS

Trying to scrape some weather data off of Weather Underground.试图从 Weather Underground 中抓取一些天气数据。 I haven't had any difficulty getting the data of interest until I came to getting the day/date, hi/lo temps, and forecast (ie. "Partly Cloudy").在获取日期/日期、高/低温度和预报(即“部分多云”)之前,我没有遇到任何困难获取感兴趣的数据。 Each is in a div without a class.每个都在一个没有 class 的 div 中。 The parent, of each, is a div with a class="obs-date" (see image below)每个的父级是一个带有 class="obs-date" 的 div(见下图)

[WxUn HTML image][1] [WxUn HTML 图像][1]

Attempted code below with other options commented out.下面尝试的代码带有注释掉的其他选项。 Each returns an empty list.每个返回一个空列表。

def get_wx(city, state):
    city=city.lower()
    state=state.lower()
    
    # get current conditions; 'weather' in url
    current_dict = get_current(city, state)

    # get forecast; 'forecast' in url
    f_url = f'https://www.wunderground.com/forecast/us/{state}/{city}'
    f_response = req.get(f_url)
    f_soup = BeautifulSoup(f_response.text, 'html.parser')
    cast_dates = f_soup.find_all('div', class_="obs-date")
    # cast_dates = f_soup.find_all('div', attrs={"class":"obs-date"})
    # cast_dates = f_soup.select('div.obs-date')
    print(cast_dates)
    
get_wx("Portland", "ME")

Any help with what I'm missing is appreciated.对我所缺少的任何帮助表示赞赏。

As far as I can see the whole block you're trying to parse is driven by javascript, that's why you're getting empty results using beautifulsoup据我所知,您尝试解析的整个块是由 javascript 驱动的,这就是为什么您使用beautifulsoup得到空结果的原因

The ADDITIONAL CONDITIONS part could be parsed completely using bs4 as well as everything below.可以使用bs4以及以下所有内容完全解析附加条件部分。 Table at the end could be parsed using pandas .最后的表格可以使用pandas解析。

To scrape javascript content, you can use requests-html or selenium libraries.要抓取 javascript 内容,您可以使用requests-htmlselenium库。

from requests_html import HTMLSession
import json

session = HTMLSession()
url = "https://www.wunderground.com/weather/us/me/portland"
response = session.get(url)
response.html.render(sleep=1)

data = []

current_date = response.html.find('.timestamp strong', first = True).text
weather_conditions = response.html.find('.condition-icon p', first = True).text
gusts = response.html.find('.medium-uncentered span', first = True).text
current_temp = response.html.find('.current-temp .is-degree-visible', first = True).text

data.append({
    "Last update": current_date,
    "Current weather": weather_conditions,
    "Temperature": current_temp,
    "Gusts": gusts,
})

print(json.dumps(data, indent = 2, ensure_ascii = False))

Output: Output:

[
  {
    "Last update": "1:27 PM EDT on April 14, 2021",
    "Current weather": "Fair",
    "Temperature": "49 F",
    "Gusts": "13 mph"
  }
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM