简体   繁体   中英

Scrape image URL from website using R

I am trying to get the image URLs from a webpage using 'rvest' in R but have been unsuccessful. Below is the code:

library(rvest)
library(magrittr)

imageURL <- read_html("https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue") %>%
    html_nodes(css = "img") %>%
    html_attr("src")

The same code works for " https://en.wikipedia.org/wiki/Lady_Jane_Grey "

Don't know where I am going wrong.

This is a tricky one as Ista rightly point out. But one alternative to employing a full JavaScript solution is to parse the json which feed such scripts.

A simple search in the source's html code allows you to identify that the images' urls are stored in a json inside the node that starts with the string "window.__ PRELOADED_STATE__ =".

library(tidyverse)
library(rvest)
library(jsonlite)

obj <- read_html("https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue")

extracted_json <- obj %>% 
                  html_nodes(xpath = '//script') %>% 
                 .[10] %>% ## The relevant content is in the 10th script node
                 html_text(trim = TRUE) %>% 
                 gsub('^window.__PRELOADED_STATE__ = |[;]$', '', .) ## clean the string to obtain a regular json structure.

object_json <-  fromJSON(extracted_json,simplifyDataFrame = TRUE)

We print object_json and search for a cluster of .jpg strings...

object_json

And we find one such cluster in the address "$ product $ productDetails $images", which happens to be a dataframe rather than a simple list.

DF <- object_json$`product`$`productDetails`$images %>% as_data_frame()
unique(DF$url)

Open https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue in your web browser, right-click and select "view source" or similar. Then, search the source for img . You won't find anything corresponding to the image you are interested in. Why? Because that page doesn't contain the image; it contains javascript that generates a page containing the image. The rvest package doesn't evaluate that javascript; it works directly with the source you see when you click the "view source" button in your browser.

Bottom line, that page is going to be very difficult to work with using rvest . Your best bet is probably to use a browser driver instead, eg, Rselenium .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM