简体   繁体   中英

BeautifulSoup python module can't find text in a website

I am trying to find weather temperature from weather.com using BeautifulSoup. If I go to the URL and inspect element, 8:00 pm, the text I am looking for is on the website. However, the code outputs a NoneType object and can't find an instance of the text. I tried weather_entry=soup.find(text="8.00") and that didn't yield any results either.

import requests
import re
from bs4 import BeautifulSoup
  
def weather():
    url='https://weather.com/weather/hourbyhour/l/823266028e3362e3a9578cfe64cb1c6ac654c492d22b41dbe3ac567cd31e1083'
      
    #open with GET method
    resp=requests.get(url)
      
    #http_respone 200 means OK status
    if resp.status_code==200:
        
        soup=BeautifulSoup(resp.text,'html.parser')    

        # this line is the problem, .find("8:00) and .find(text=re.compile("8:00") don't work either
        weather_entry=soup.find(text=re.compile("8:00 pm"))

        print(str(weather_entry)+"\n")
        print(weather_entry.get_text())
        
    else:
        print("Error")
          
weather()

I think that the weather information you are trying to find is contained in Javascript. If you switch to Debugger in the developers console (I'm using Firefox) you will see a folder called hourly/assets which contains a lot of js scripts.

I've tried to do use Beautiful Soup to read weather websites previously and come up against the exact same problem. The solution I found (which may not be available to you) was to ask the website for access to the raw data via JSON or API.

Another solution I have used previously is to find a website for an amateur web station, which is far more likely to be written in pure HTML

Your assertions about the HTML containing the text 8:00 are somewhat misleading: Looking at the HTML from the URL in your program there's a huge chunk of Javascript with JSON data which does indeed contain 8:00 (although not 8:00 pm , one of your suggestions), but fortunately the actual HTML tags contain the data too, but as 8 pm (when rendered in a browser the times are also shown like this). Since the data does exist in the HTML it can be extracted with BeautifulSoup, but a bit more work is needed to home in on the data you're after. If you wanted to get the data from the JSON/Javascript instead you'd probably want to approach that differently as per @simpleApp's comment.

When scraping web pages I highly recommend downloading the HTML of the page concerned so that you can look closely at its structure, tags etc which will help you work out how to handle it with BeautifulSoup. Downloaded HTML generally isn't formatted making it very hard to read, but you can fix that easily with a pretty-printer/formatter - Tidy is one option (specify -indent on the command-line).

Looking at the now nicely-formatted HTML it has the info you're looking for in tags like this (simplified for legibility),

  <div id="titleIndex3">    (the number after titleIndex increments)
    <h2>8 pm</h2>
    <div>
      <span>31°</span>

so using BeautifulSoup to go through the titleIndexN <div> s to find the temperature for the desired time,

    soup=BeautifulSoup(resp.text,'html.parser')

    for div_tag in soup.find_all('div'):
        if div_tag.has_attr('id') and div_tag['id'].startswith('titleIndex'):
            if div_tag.h2.text == '8 pm':
                print(div_tag.div.span.text)

in fact the site displays data for more than 1 day so you'll need a further condition to narrow it down to the day you want, the above sample code should get you on the right track.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM