简体   繁体   English

Scrape网页未进行Ajax调用,但数据不在DOM中

[英]Scrape webpage no ajax calls made but data not in DOM

I'm doing an exercise in scraping data from a website. 我正在练习从网站抓取数据。 For example, ZocDoc . 例如, ZocDoc I'm trying to get a list of all insurance providers and their plans (You can access this information on their homepage in the insurance dropdown). 我正在尝试获取所有保险提供者及其计划的列表(您可以在保险下拉列表的主页上访问此信息)。

It appears that all data is loaded via a <scipt> tag when the page loads. 看来页面加载时所有数据都是通过<scipt>标记加载的。 When looking in the network tab there doesn't appear to be any network calls that returns JSON including the plan names. 在“网络”标签中查看时,似乎没有任何网络调用返回JSON,包括计划名称。 I am able to get all the insurance plans using with the following (It's messy, but it works). 我可以使用以下所有保险计划(虽然很杂乱,但是可以使用)。

  import requests
  from bs4 import BeautifulSoup as bs
  resp = requests.get('https://zocdoc.com')
  long_str = str(soup.findAll('script')[17].string)
  pop = data.split("Popular Insurances")[1]
  json.loads(pop[pop.find("[["):pop.find("]]")+2])

In the HTML returned there are no insurance plans. 在返回的HTML中没有保险计划。 I also don't see any requests in the network tab where the plans are sent back (there are a few backbone files). 我也没有在计划发回的“网络”选项卡中看到任何请求(有一些主干文件)。 One url looks encoded but I'm not sure that that is it and I'm just overthinking this url . 一个URL看起来已经编码了,但是我不确定就是这样,而我只是在想这个URL

I've also tried waiting for all the JS to load so the data is in the DOM using dryscrape but still no plans in the HTML. 我也尝试过等待所有JS加载,以便使用dryscrape将数据保存在DOM中,但HTML中仍然没有任何计划。

Is there a way to gather this information without having a crawler click on every insurance provider to get their plans? 有没有一种方法可以让爬虫无需单击每个保险提供者来获取其计划就可以收集此信息?

Yes, the list of insurances is kept deep inside the script tag: 是的,保险清单深藏在script标记内:

insuranceModel = new gs.CarrierGroupedSelect(gs.CarrierGroupedSelect.prototype.parse({
...
primary_options: {
        name: "Popular Insurances",
        group: "primary",
        options: [[300,"Aetna",2,0,1,0],[304,"Blue Cross Blue Shield",2,1,1,0],[307,"Cigna",2,0,1,0],[369,"Coventry Health Care",2,0,1,0],[358,"Medicaid",2,0,1,0],[322,"UniCare",2,0,1,0],[323,"UnitedHealthcare",2,0,1,0]]
    },
    secondary_options: {
        name: "All Insurances",
        group: "secondary",
        options: [[440,"1199SEIU",2,0,1,0],[876,"20/20 Eyecare Plan",2,0,1,1],...]
    }
...

You can, of course, dive into wonderful world of JavaScript code parsing in Python either with regular expressions or Javascript parsers like slimit ( example here ), but this might result into less hair on the head. 当然,你可以的,深入的JavaScript代码奇妙世界在Python或者使用正则表达式或如Javascript解析器解析slimit例如这里 ),而这可能导致到头上头发少。 Plus, the result solution would be quite fragile. 另外,结果解决方案将非常脆弱。

In this particular case, I think selenium is a much better fit . 在这种特殊情况下,我认为selenium 更合适 Complete working example - getting the popular insurances: 完整的工作示例-获得受欢迎的保险:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.maximize_window()

wait = WebDriverWait(driver, 10)
insurance_dropdown = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "I'll choose my insurance later")))
insurance_dropdown.click()

for option in driver.find_elements_by_css_selector("[data-group=primary] + .ui-gs-option-set > .ui-gs-option"):
    print(option.get_attribute("data-value"))

driver.close()

Prints: 打印:

Aetna
Blue Cross Blue Shield
Cigna
Coventry Health Care
Medicaid
UniCare
UnitedHealthcare

Note that in this case the headless PhantomJS browser is used, but you can use Chrome or Firefox or other browsers that selenium has an available driver for. 请注意,在这种情况下,将使用无头PhantomJS浏览器,但是您可以使用Chrome或Firefox或硒具有可用驱动程序的其他浏览器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM