简体   繁体   中英

Use R to loop through list of URLs from csv, open URLs, and assess whether those sites contain a certain text string

This might be an easy R question, but I'm still learning.

I have a long list of URLs from the EPA contained in a CSV that link to particular discharge permits/facilities. Each row of the CSV contains a single URL. Some URLs go to an active page with information about the facility available and others (the ones I'm ultimately interested in identifying) go to a page that reads "No program facility found for NPDES - [permit number]."

I want to use R to go through this csv list of URLs, open each URL, and return a TRUE or FALSE value regarding whether the URL is good or not. A "bad" URL is based on whether the page returns the "No program facility found" text. Ideally, the TRUE or FALSE value returns could be added into a column next to the site URL so I can easily go through and identify which are good links and which aren't.

I would appreciate any advice you might have for where to get started!

I was able to set this up to work with a single link at a time using library(httr).

# Bad URL

site1 <- GET("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VA0086738&pgm_sys_acrnm_in=NPDES")
contents1 <- content(site1, "text")
any(grepl("No program facility found", contents1))
# [1] TRUE

# Good URL
site2 <- GET("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VAG401896&pgm_sys_acrnm_in=NPDES")
contents2 <- content(site2, "text")
any(grepl("No program facility found", contents2))
# [1] FALSE

Here is a solution with only the two links you provided:

 library(httr) 

I wrote the following lines to write a dataset to be used by other readers (you can skip this and start from the next block of code):

#stackoverflow_question_links<- data.frame("Links"=c("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VA0086738&pgm_sys_acrnm_in=NPDES","https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VAG401896&pgm_sys_acrnm_in=NPDES"))
#write.csv(stackoverflow_question_links, "stackoverflow_question_links.csv")

Assuming that your dataset is called "stackoverflow_question_links.csv", we start reading it into R:

fileName <- "stackoverflow_question_links.csv"
con <- file(fileName,open="r")
lin <-readLines(con)
save<-NULL #initialize save, to save the links with their status (true/false)
for (i in 2:length(lin)){
  site <- GET(lin[i])
  contents <- content(site, "text")
  save<-rbind(save, data.frame("Link" = lin[i],"Status"=any(grepl("No program facility found", contents))))
}
close(conn)
View(save) #or write.csv(save, "links_status.csv") 

在此处输入图像描述

We can also use rvest to do this. Assuming your data is called df and all the links are present in url column of the data, we can create a new column ( text_found ) in the dataframe which indicates whether the text ( 'No program facility found' ) was found on that url or not. So if the text is not found at the URL then it is a good url and vice-versa.

library(rvest)
library(dplyr)

df %>%
    mutate(text_found = purrr::map_lgl(url, ~  .x %>% read_html %>% 
                      html_text() %>%  grepl('No program facility found', .)),
           Good_URL = !text_found)


                                  url       text_found    Good_URL
1 https://iaspub.epa.gov/enviro......             TRUE       FALSE
2 https://iaspub.epa.gov/enviro......            FALSE        TRUE

data

df <- data.frame(url = c("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VA0086738&pgm_sys_acrnm_in=NPDES", 
                         "https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=VAG401896&pgm_sys_acrnm_in=NPDES"), 
                 stringsAsFactors = FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM