如何在 R 上写 web 的 function 抓取？

Question

I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面，所以我尝试创建一个 function 以便我更轻松地更改链接。

See the function:见function：

link = "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"

scraper <- function(link){
  page = read_html(link)
    titulo = page %>% html_nodes("h4 a") %>% html_text()
    tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
    data = page %>% html_nodes("p.col-md-6") %>% html_text()
    protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
    situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
    regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
    quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
    autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
    assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
    
    result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
}

But when I run the function nothing happens.但是当我运行 function 时，什么也没有发生。

I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面，所以我尝试创建一个 function 以便我更轻松地更改链接。

Answer 1

Scraping the first 5 pages into a tibble将前 5 页拼凑成小标题

rm(list = ls())
library(tidyverse)
library(rvest)

get_content <- function(page) {
  content <-
    str_c(
      "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
      page,
      "&Documento=117&Modulo=8&AnoInicial=2022"
    ) %>%
    read_html() %>%
    html_elements(".data-list-hover")
  
  tibble(
    titulo = content %>% html_nodes("h4 a") %>% html_text2(),
    tipo = content %>% html_nodes("h4+ .row .col-md-4") %>% html_text2(),
    data = content %>% html_nodes("p.col-md-6") %>% html_text2(),
    protocolo = content %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text2(),
    situacao = content %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>%  html_text2(),
    regime = content %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text2(),
    quorum = content %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text2(),
    autoria = content %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text2(),
    assunto = content %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text2()
    
  ) %>%
    mutate(across(everything(), ~ str_remove_all(.x, "\r") %>%
                    str_squish()))
}

map_dfr(1:5, get_content)

如何在 R 上写 web 的 function 抓取？

问题描述

1 个解决方案

解决方案1
0 2022-12-03 22:44:25

如何在 R 上写 web 的 function 抓取？

问题描述

1 个解决方案

解决方案1 0 2022-12-03 22:44:25

解决方案1
0 2022-12-03 22:44:25