[英]How can I write a function of web scraping on R?
I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面,所以我尝试创建一个 function 以便我更轻松地更改链接。
See the function:见function:
link = "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"
scraper <- function(link){
page = read_html(link)
titulo = page %>% html_nodes("h4 a") %>% html_text()
tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
data = page %>% html_nodes("p.col-md-6") %>% html_text()
protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
}
But when I run the function nothing happens.但是当我运行 function 时,什么也没有发生。
I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面,所以我尝试创建一个 function 以便我更轻松地更改链接。
Scraping the first 5 pages into a tibble将前 5 页拼凑成小标题
rm(list = ls())
library(tidyverse)
library(rvest)
get_content <- function(page) {
content <-
str_c(
"https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
page,
"&Documento=117&Modulo=8&AnoInicial=2022"
) %>%
read_html() %>%
html_elements(".data-list-hover")
tibble(
titulo = content %>% html_nodes("h4 a") %>% html_text2(),
tipo = content %>% html_nodes("h4+ .row .col-md-4") %>% html_text2(),
data = content %>% html_nodes("p.col-md-6") %>% html_text2(),
protocolo = content %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text2(),
situacao = content %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text2(),
regime = content %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text2(),
quorum = content %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text2(),
autoria = content %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text2(),
assunto = content %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text2()
) %>%
mutate(across(everything(), ~ str_remove_all(.x, "\r") %>%
str_squish()))
}
map_dfr(1:5, get_content)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.