簡體   English   中英

在R中使用JavaScript從網頁上抓取鏈接

[英]scraping links from webpage with JavaScript in R

我試圖從http://www.childrenshospital.org/directory刮取各個提供商的網址?state =%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C %22search_type%22%3A%5B%22directory醫師%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D

我查看了頁面來源並確定了感興趣的網址。 例如,我要從以下節點中刮取“ http://www.childrenshospital.org/doctors/mirna-aeschlimann

<a data-layer-event="searchClick" data-bind="attr: {href: model.Url}" href="http://www.childrenshospital.org/doctors/mirna-aeschlimann"><!--ko text: model.FirstName-->Mirna<!--/ko--><!--ko text: ' ' + model.LastName--> Aeschlimann<!--/ko--><!--ko if: model.Suffix-->, <!--ko text: model.Suffix-->MD<!--/ko--><!--/ko--></a>

我嘗試了以下代碼。 但是,由於某些原因,它沒有返回上面的節點。

base_html <- "http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D"
doc <- htmlTreeParse(base_html, useInternal = TRUE)

任何幫助將不勝感激。 請讓我知道是否需要提供更多信息。

您是否嘗試針對網站提出的XHR請求以獲取數據?

library(httr)
library(purrr)
library(xml2)
library(dplyr)
library(jsonlite)

map_df(1:17, function(i) {

  POST("http://www.childrenshospital.org/searchdirectory.ajax",
       body = list(search_query = "",
                   search_specialties = "",
                   search_languages = "",
                   search_gender = "", search_departments = "",
                   search_programs = "", search_userlocation = "",
                   search_radius = "10", search_pcp = "true",
                   search_specialist = "false",
                   search_type = "directorynurse|directoryphysician",
                   search_letter = "", search_querygroup = "dirnametext",
                   search_page = "10"),
       encode = "form") -> res

  content(res, as="text") %>%
    fromJSON() %>%
    .$Records %>%
    mutate(Address=xml2::xml_text(xml2::read_html(paste0("<x>", Address, "</x>")))) %>%
    tbl_df()

}) -> tmp_df

glimpse(tmp_df)
## Observations: 408
## Variables: 21
## $ ID             <chr> "{E8ECAF3B-B49C-4CD8-AB16-6CE63F0379C0}", "{1E1...
## $ FirstName      <chr> "Jonathan", "Barbara", "Mark", "Maura", "Sarah"...
## $ LastName       <chr> "Schwab", "Seagle", "Shapira", "Shea", "Sheldon...
## $ Image          <chr> "/~/media/directory/physicians/schwab_jonathan....
## $ Suffix         <chr> "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD",...
## $ Url            <chr> "http://www.childrenshospital.org/doctors/jonat...
## $ Gender         <chr> "male", "female", "male", "female", "female", "...
## $ Latitude       <chr> "42.3344382", "42.326435", "41.559642", "42.423...
## $ Longitude      <chr> "-72.6618324", "-71.149499", "-70.939315", "-71...
## $ Address        <chr> "{\"practice_name\":\"Northampton Area Pediatri...
## $ Distance       <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
## $ OtherLocations <chr> "", "", "", "", "", "", "", "", "", "", "Westwo...
## $ AcademicTitle  <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
## $ HospitalTitle  <chr> "Pediatrician", "Pediatrician", "Pediatrician",...
## $ Specialties    <chr> "Pediatrics", "General Pediatrics, Pediatrics, ...
## $ Departments    <chr> "", "General Pediatrics", "General Pediatrics",...
## $ Languages      <chr> "", "English", "", "English", "English", "", "E...
## $ PPOCLink       <chr> "http://www.childrenshospital.org/patient-resou...
## $ Gallery        <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
## $ Phone          <chr> "(413) 584-8700", "(617) 731-0200", "(508) 996-...
## $ Fax            <chr> "(413) 584-1714", "(617) 731-0289", "(508) 992-...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM