简体   繁体   English

如何在 R 上写 web 的 function 抓取?

[英]How can I write a function of web scraping on R?

I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面,所以我尝试创建一个 function 以便我更轻松地更改链接。

See the function:见function:

link = "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"

scraper <- function(link){
  page = read_html(link)
    titulo = page %>% html_nodes("h4 a") %>% html_text()
    tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
    data = page %>% html_nodes("p.col-md-6") %>% html_text()
    protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
    situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
    regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
    quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
    autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
    assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
    
    result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
}

But when I run the function nothing happens.但是当我运行 function 时,什么也没有发生。

I'm trying scrape a site with ten pages.我正在尝试抓取一个有十页的网站。 I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.我不知道如何做一个循环来抓取所有页面,所以我尝试创建一个 function 以便我更轻松地更改链接。

Scraping the first 5 pages into a tibble将前 5 页拼凑成小标题

rm(list = ls())
library(tidyverse)
library(rvest)

get_content <- function(page) {
  content <-
    str_c(
      "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
      page,
      "&Documento=117&Modulo=8&AnoInicial=2022"
    ) %>%
    read_html() %>%
    html_elements(".data-list-hover")
  
  tibble(
    titulo = content %>% html_nodes("h4 a") %>% html_text2(),
    tipo = content %>% html_nodes("h4+ .row .col-md-4") %>% html_text2(),
    data = content %>% html_nodes("p.col-md-6") %>% html_text2(),
    protocolo = content %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text2(),
    situacao = content %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>%  html_text2(),
    regime = content %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text2(),
    quorum = content %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text2(),
    autoria = content %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text2(),
    assunto = content %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text2()
    
  ) %>%
    mutate(across(everything(), ~ str_remove_all(.x, "\r") %>%
                    str_squish()))
}

map_dfr(1:5, get_content)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM