简体   繁体   中英

Scraping web table using R and rvest

I'm new in web scraping using R. I'm trying to scrape the table generated by this link: https://gd.eppo.int/search?k=saperda+tridentata . In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).

I tried to follow the suggestion by Allan Cameron given here ( rvest, table with thead and tbody tags ) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page". Where can I get this link? In this specific case I used " https://gd.eppo.int/media/js/application/zzsearch.js?7 ", is this one?

Below you have my code. Thank you in advance!

library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)

pest.name <- "saperda+tridentata"

url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text") 

json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8") 

table_contents <- JSON     %>%
  {gsub("\\\\n", "\n", .)}  %>%
  {gsub("\\\\/", "/", .)}   %>%
  {gsub("\\\\\"", "\"", .)} %>%
  strsplit("html\":\"")    %>%
  unlist                   %>%
  extract(2)               %>%
  substr(1, nchar(.) -2)   %>% 
  paste0("</tbody>")

new_page <- gsub("</tbody>", table_contents, resp)

read_html(new_page)   %>%
  html_nodes("table") %>%
  html_table()

The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.

library(httr)
library(jsonlite)

params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM