简体   繁体   中英

BeautifulSoup parser is not parsing full webpage

I am using BeautifulSoup to parse the CDC website for the US COVID cases for one of my summer projects and the problem that I am having is that the full html code is not being loaded when I use .find_all . It instead loads the information for the header and footer of the CDC site, I also tried xlml and html.parse and that did not work.

import requests, html5lib
from bs4 import BeautifulSoup

#website used for webscraping

result = requests.get('https://www.cdc.gov/covid-data-tracker/#cases')

source = result.content

soup = BeautifulSoup(source,'html5lib')

data = [words.text for words in soup.find_all('div')]

print(data)

While it is true the page is rendered with JavaScript, the JavaScript fetches JSON files which can be accessed directly to obtain the data you want. Just get the JSON and parse it as you would for any other JSON.

import requests
# import json

r = requests.get('https://www.cdc.gov/covid-data-tracker/Content/CoronaViewJson_01/US_MAP_DATA.json')
j = r.json()

# print(json.dumps(j, indent=4))  # Print the whole JSON object

# Print data for state
for state in j['US_MAP_DATA']:
    print(f"State: {state['name']}, Total cases: {state['tot_cases']}, Total death: {state['tot_death']}")

Outputs:

State: Alaska, Total cases: 2878, Total death: 23
State: Alabama, Total cases: 85906, Total death: 1567
State: Arkansas, Total cases: 41759, Total death: 442
State: American Samoa, Total cases: 0, Total death: 0
State: Arizona, Total cases: 170798, Total death: 3626
etc.

If you un-comment the import and print statements the JSON object will get printed by the code.

To see the other JSON files which may be of interest right click the web-page in Chrome or Firefox, select Inspect then Network then XHR and refresh the page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM