简体   繁体   中英

Web Scraping with R - {xml_nodeset (0)}

I'm new to R and I'm trying to get data from this website: https://spritacular.org/gallery .
I want to get the location, time and the hour. I am following this guide , using the SelectorGadget I clicked on the elements I wanted (.card-title, .card-subtitle, .mb-0).
However, it always outputs {xml_nodeset (0)} and I'm not sure why it's not getting those elements.

This is the code I have:

url <- "https://spritacular.org/gallery"
sprite_gallery <- read_html(url)

sprite_location <- html_nodes(sprite_gallery, ".card-title , .card-subtitle , .mb-0")

sprite_location

When I change the website and grab something from a different website it works, so I'm not sure what I'm doing wrong and how to fix it, this is my first time doing something like this and I appreciate any insight you may have!

As per comment, this website has JS embedded and the information only opens when a browser is opened. If you go to developers tools and.network tab, you can see the underlying json data If you post a GET request for this api address, you will get a list back with all the results. From their, you can slice and dice your way to get the required information you need.

One way to do this: I have considered the name of the user who submitted the image and I found out that same user has submitted multiple images. Hence there are duplicate names and locations in the output but the image URL is different. Refer this blog to know how to drill down the json data to make useful dataframes in R

library(httr)
library(tidyverse)

getURL <- 'https://api.spritacular.org/api/observation/gallery/?category=&country=&cursor=cD0xMTI%3D&format=json&page=1&status='

# get the raw json into R
UOM_json <- httr::GET(getURL) %>% 
  httr::content()

exp_output <- pluck(UOM_json, 'results') %>%
  enframe() %>%
  unnest_longer(value) %>%
  unnest_wider(value) %>%
  select(user_data, images) %>%
  unnest_wider(user_data) %>%
  mutate(full_name = paste(first_name, last_name)) %>%
  select(full_name, location, images) %>%
  rename(., location_user = location) %>%
  unnest_longer(images) %>%
  unnest_wider(images) %>%
  select(full_name, location, image)
  

Output of our exp_output

> head(exp_output)
# A tibble: 6 × 3
  full_name     location                          image                                                                                
  <chr>         <chr>                             <chr>                                                                                
1 Kevin Palivec Jones County,Texas,United States  https://d1dzduvcvkxs60.cloudfront.net/observation_image/1d4cc82f-f3d2…
2 Kamil Świca   Lublin,Lublin Voivodeship,Poland  https://d1dzduvcvkxs60.cloudfront.net/observation_image/3b6391d1-f839…
3 Kamil Świca   Lublin,Lublin Voivodeship,Poland  https://d1dzduvcvkxs60.cloudfront.net/observation_image/9bcf10d7-bd7c…
4 Kamil Świca   Lublin,Lublin Voivodeship,Poland  https://d1dzduvcvkxs60.cloudfront.net/observation_image/a7dea9cf-8d6e…
5 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/539e0870-c931…
6 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/c729ea03-e1f8…
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM