简体   繁体   中英

Scrape webpage no ajax calls made but data not in DOM

I'm doing an exercise in scraping data from a website. For example, ZocDoc . I'm trying to get a list of all insurance providers and their plans (You can access this information on their homepage in the insurance dropdown).

It appears that all data is loaded via a <scipt> tag when the page loads. When looking in the network tab there doesn't appear to be any network calls that returns JSON including the plan names. I am able to get all the insurance plans using with the following (It's messy, but it works).

  import requests
  from bs4 import BeautifulSoup as bs
  resp = requests.get('https://zocdoc.com')
  long_str = str(soup.findAll('script')[17].string)
  pop = data.split("Popular Insurances")[1]
  json.loads(pop[pop.find("[["):pop.find("]]")+2])

In the HTML returned there are no insurance plans. I also don't see any requests in the network tab where the plans are sent back (there are a few backbone files). One url looks encoded but I'm not sure that that is it and I'm just overthinking this url .

I've also tried waiting for all the JS to load so the data is in the DOM using dryscrape but still no plans in the HTML.

Is there a way to gather this information without having a crawler click on every insurance provider to get their plans?

Yes, the list of insurances is kept deep inside the script tag:

insuranceModel = new gs.CarrierGroupedSelect(gs.CarrierGroupedSelect.prototype.parse({
...
primary_options: {
        name: "Popular Insurances",
        group: "primary",
        options: [[300,"Aetna",2,0,1,0],[304,"Blue Cross Blue Shield",2,1,1,0],[307,"Cigna",2,0,1,0],[369,"Coventry Health Care",2,0,1,0],[358,"Medicaid",2,0,1,0],[322,"UniCare",2,0,1,0],[323,"UnitedHealthcare",2,0,1,0]]
    },
    secondary_options: {
        name: "All Insurances",
        group: "secondary",
        options: [[440,"1199SEIU",2,0,1,0],[876,"20/20 Eyecare Plan",2,0,1,1],...]
    }
...

You can, of course, dive into wonderful world of JavaScript code parsing in Python either with regular expressions or Javascript parsers like slimit ( example here ), but this might result into less hair on the head. Plus, the result solution would be quite fragile.

In this particular case, I think selenium is a much better fit . Complete working example - getting the popular insurances:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.maximize_window()

wait = WebDriverWait(driver, 10)
insurance_dropdown = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "I'll choose my insurance later")))
insurance_dropdown.click()

for option in driver.find_elements_by_css_selector("[data-group=primary] + .ui-gs-option-set > .ui-gs-option"):
    print(option.get_attribute("data-value"))

driver.close()

Prints:

Aetna
Blue Cross Blue Shield
Cigna
Coventry Health Care
Medicaid
UniCare
UnitedHealthcare

Note that in this case the headless PhantomJS browser is used, but you can use Chrome or Firefox or other browsers that selenium has an available driver for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM