简体   繁体   中英

Python BeautifulSoup missing tag that I believe to be clearly there

I have searched a few different questions, and haven't found an exact match for this. My search criteria is quite simple, so I'm thinking maybe the problem is just that I don't understand something about the HTML, or how BeautifulSoup works.

I am searchin for the Actual Precipitation amount the Summary table from this site. In the link below, its value is 0.01. I will eventually iterate over this site, inserting a different day into the URL for each day of the year, giving me the daily rainfall for Houston for each day of 2019.

https://www.wunderground.com/history/daily/KHOU/date/2019-01-01

this is a picture of the website and html, to make it extra clear what I want to find

code below:

result = requests.get('https://www.wunderground.com/history/daily/KHOU/date/2019-01-01')
content = result.content
soup= BeautifulSoup(content, 'html.parser')  #BS4 Documentation states this is 


test1 = soup.find_all('div', {'class':'summary-table'})
len(test1)  #1
print(test1[0].prettify())
#<div _ngcontent-sc235="" class="summary-table">
# No data recorded
# <!-- -->
# <!-- -->
#</div>

#I also tried just finding the tbody tags directly
test2 = soup.find_all('tbody')
len(test2)  #1
print(test2[0].prettify())
#<tbody _ngcontent-sc225="">
# <!-- -->
#</tbody>

Expected Ouput:

Test1 : I expected to get the a list of div tag that had class = 'summary-table' (I'm pretty sure there is only one of these). I was expecting this to also contain all of the div tags and tbody/thead tags that were within it.

Test2 : I was expecting this to be a list of all of the tbody tags, so I could iterate over them to find what I wanted.

Since I can very clearly see the tags that I want to grab, I feel like there is something obvious I'm missing here. Any help would be very appreciated!

You can't expect to use BeautifulSoup for scraping absolutely everything - not all webpages are the same. In the case of the page you are trying to scrape, the data in the table is generated asynchronously via JavaScript. Here's a coarse sequence of events that take place when you visit your URL in a browser:

  1. Your browser makes an initial HTTP GET request to the URL of the webpage.
  2. The server responds and serves you that HTML document
  3. Your browser parses the document and makes many more asynchronous requests to the server (and potentially different servers) to resources that it needs to completely render the page as it is meant to be seen by human eyes (fonts, images, etc.). At this point, the browser also makes requests to an API that serves JSON so that it may populate the table with data.

Your code basically only does step one. It makes a request to the bare-bones HTML document, which hasn't been populated yet. That's why BeautifulSoup can't see the data you're looking for. In general, you can only really use BeautifulSoup to scrape data from webpages if a given webpage has all the data baked into the HTML document. This used to be more common years ago, but I'd say nowadays most (modern) pages populate the DOM asynchronously using JavaScript.

Usually, this is where people recommend you use Selenium or some other kind of headless browser to completely simulate a browsing session, but in your case that's overkill. To get the data you want, all you have to do is make requests to the same API (the one I mentioned earlier in step three) that your browser makes requests to. You don't even need BeautifulSoup for this.

If you log your network traffic, you will see that your browser makes several requests to an API serving JSON. Here are a couple links that show up. Go ahead and click on them to view the structure of the JSON response:

https://api.weather.com/v1/location/KHOU:9:US/almanac/daily.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&start=0101

https://api.weather.com/v1/location/KHOU:9:US/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20190101&endDate=20190101

There are more but you get the idea. The query is pretty straightforward, you pass an API key and a date (or sometimes a start and end date) as query string parameters. Not sure what units=e means.

Also, this API doesn't seem to care about request headers, which is nice. That's not always the case - some APIs desperately care about all kinds of headers, like user-agent , etc. It wouldn't be very difficult to fake, either, but I appreciate simple APIs.

Here's some code I came up with:

def get_api_key():

    import requests
    import re

    url = "https://www.wunderground.com/history/daily/KHOU/date/2019-01-01"

    response = requests.get(url)
    response.raise_for_status()

    pattern = "SUN_API_KEY&q;:&q;(?P<api_key>[^&]+)"
    return re.search(pattern, response.text).group("api_key")

def get_average_precip(api_key, date):

    import requests

    url = "https://api.weather.com/v1/location/KHOU:9:US/almanac/daily.json"

    params = {
        "apiKey": api_key,
        "units": "e",
        "start": date.strftime("%m%d")
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    return response.json()["almanac_summaries"][0]["avg_precip"]

def get_total_precip(api_key, start_date, end_date):

    import requests

    url = "https://api.weather.com/v1/location/KHOU:9:US/observations/historical.json"

    params = {
        "apiKey": api_key,
        "units": "e",
        "startDate": start_date.strftime("%Y%m%d"),
        "endDate": end_date.strftime("%Y%m%d")
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    return next(obv["precip_total"] for obv in response.json()["observations"] if obv["precip_total"] is not None)

def get_hourly_precip(api_key, start_date, end_date):

    import requests

    url = "https://api.weather.com/v1/location/KHOU:9:US/observations/historical.json"

    params = {
        "apiKey": api_key,
        "units": "e",
        "startDate": start_date.strftime("%Y%m%d"),
        "endDate": end_date.strftime("%Y%m%d")
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    for observation in response.json()["observations"]:
        yield observation["precip_hrly"]

def main():

    import datetime

    api_key = get_api_key()

    # January 3rd, 2019
    date = datetime.date(2019, 1, 3)
    avg_precip = get_average_precip(api_key, date)

    start_date = date
    end_date = date
    total_precip = get_total_precip(api_key, start_date, end_date)

    print(f"The average precip. is {avg_precip}")
    print(f"The total precip between {start_date.isoformat()} and {end_date.isoformat()} was {total_precip:.2f} inches")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

The average precip. is 0.12
The total precip between 2019-01-03 and 2019-01-03 was 1.46 inches
>>> 

I also defined a function get_hourly_precip , which I didn't actually use, I just implemented it for kicks. It pulls the precipitation data from the graph.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM