简体   繁体   中英

Python: I am trying to web scrape a page but I am not able to find the html

I am trying to scrape this page ( https://www.polarislist.com/ ) I am trying to pull all of the data such as class size, free/reduced lunch/ student/tacher ratio, % of student demographics by race ,and the respective counts of MIT, Harvard, Princeton admits.

However when I go look and inspect the page source, I am not able to find the tag that contains such information

I am using Python 3.7, Bs4 I have inspected the page source

what i have so far:

#importing lbiraries
import requests
import bs4
from bs4 import BeautifulSoup

page_link = 'https://www.polarislist.com'
page_response = requests.get(page_link, timeout=5)

page_content = BeautifulSoup(page_response.content, "html.parser")
result_name_of_hs = page_content.find_all('div', attrs={'data-test': 'name'})
print(result_name_of_hs)

***output is []

I expected BS4 to get the identified tag and pull it from the site. However when I am in the Inspect Page element, I am not able to find anything,

I saw this when I inspected an element, but could not get the div data-testname

<div class="font-size-20 font-weight-semi-bold block-with-text" data-test="name">THOMAS JEFFERSON HIGH SCHOOL</div>

The data you see are loaded asynchronously by the page. When you open Firefox/Chrome developer tools, you will see the data are pulled from different URL (in this case https://www.polarislist.com/api/high_schools_orange_cake ).

To load data from JSON you can use this:

import json
import requests

url = 'https://www.polarislist.com/api/high_schools_orange_cake'

data = requests.get(url).json()

print(json.dumps(data, indent=4))

Prints:

[
    {
        "id": 18450,
        "name": "THOMAS JEFFERSON HIGH SCHOOL",
        "city": "ALEXANDRIA",
        "state": "VA",
        "public": true,
        "num_senior": 423,
        "num_american_indian": 39,
        "num_asian": 1084,
        "num_hispanic": 34,
        "num_black": 24,
        "num_white": 530,
        "student_teacher_ratio": "16.93",
        "num_free_reduced_lunch": 33,
        "total_students": 1820,

    ... and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM