简体   繁体   中英

Trying to webscrape an unchanging URL with data spread over pages

I am new to Webscraping. The url I am working with is this ( https://tsmc.tripura.gov.in/doc_list ). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction. Here is my code:

url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts = 
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table11= table1[["NULL"]]

Please help. Thanks!

Perhaps try this solution:

url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results

links <- paste0(url, sq) #Paste the sequence after the url "page="

store <- NULL
tbl <- NULL

library(rvest) #extract the tables
for(i in links){
  store[[i]] = read_html(i)
  tbl[[i]] = html_table(store[[i]])

df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)

Which gives 846 observations across 8 variables.

EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:

firsturl <- "https://tsmc.tripura.gov.in/doc_list"

first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0

df2 <- rbind(first_df, df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM