简体   繁体   中英

Use R to Scrape Information from Planned Parenthood Website

I am trying use the rvest library to scrape certain information from a planned parenthood website. The webpage that I am looking at is here . I am currently trying to pull the Services Offered on the right side of the webpage such as "abortion services", "birth control", etc. I have the following code below which is close?

 URL <- "https://www.plannedparenthood.org/health-center/tn"
  Webpage <- read_html(URL)
  all_links <- Webpage %>% 
    html_nodes("p a") %>%
    html_attr('href') %>%
    paste0('https://www.plannedparenthood.org', .)
 URL <- all_links[1]
 Website <- URL
 Webpage <- read_html(URL)
 Services <- Webpage %>% html_nodes("ul li a") %>% html_attr("href")

I start at the main planned parenthood page and navigate to the first facility in TN. Can someone help me obtain the services offered?

This should do the trick:

URL <- "https://www.plannedparenthood.org/health-center/tn"
Webpage <- read_html(URL)
all_links <- Webpage %>% 
  html_nodes("p a") %>%
  html_attr('href') %>%
  paste0('https://www.plannedparenthood.org', .)
URL <- all_links[1]
Website <- URL
Webpage <- read_html(URL)
Services <- Webpage %>% html_nodes(".services a") %>% html_text()

Which gives:

> Services
[1] "Abortion Services"                            "Birth Control"                                "HIV Testing"                                  "LGBTQ Services"                              
[5] "Men's Health Care"                            "Morning-After Pill (Emergency Contraception)" "Pregnancy Testing & Services"                 "STD Testing, Treatment & Vaccines"           
[9] "Women's Health Care" 

I only changed this in the last line %>% html_nodes(".services a") %>% html_text()

So I used a more specific css selector and then only took the html text that resulted from this selector.

If you are not familiar with CSS try this Google Chrome Addon which makes getting the right CSS selector more easy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM