简体   繁体   中英

Web scraping from continuous URLs using R

I am trying to scrap data from a website which lists the ratings of multiple products. So, let's say a product has 800 brands. So, with 10 brands per page, I will need to scrap data from 8 pages. Eg: Here is the data for baby care. There are 24 pages worth of brands that I need - http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D 1 %26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D

I have used the bold font for 1, as that is the only thing that changes in this url as we move from page to page. So, I thought it might be straight forward to write a loop in R. But what I find is that as I move to page 2, the page does not load again. Instead, just the results are updated in about 5 secs. However, R does not wait for 5 seconds and thus, I had the data from the first page 26 times.

I also tried entering the page 2 url directly and ran my code without a loop. Same story- I got page 1 results. I am sure I can't be the only one facing this. Any help is appreciated. I have attached the code.

Thanks a million. And I hope my question was clear enough.

# build the URL

N<-matrix(NA,26,15)
R<-matrix(NA,26,60)

for(n in 1:26){

url <- paste("http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D",i,"%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D")


raw.data <-readLines(url)

Parse <- htmlParse(raw.data)

#####
A<-querySelector(Parse, "div.results-container")

#####
Name<-querySelectorAll(A,"div.reviews>a")
Ratings<-querySelectorAll(A,"div.value")

N[n,]<-sapply(Name,function(x)xmlGetAttr(x,"href"))
R[n,]<-sapply(Ratings,xmlValue)
}

Referring to the html source reveals that the urls you want can be simplified to this structure:

http://www.goodguide.com/products?category_id=152775-baby-care&page=2&sort_orde‌​r=DESC.

The content of these urls is retrieved by R as expected.

Note that you can also go straight to:

u <- sprintf('http://www.goodguide.com/products?category_id=152775-baby-car‌​e&page=%s&sort_order=DESC', n)
Parse <- htmlParse(u)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM