简体   繁体   中英

Web scraping with R, XML Package - paths on web browser are different from parsed HTML download in R

I am web scraping this website (in portuguese).

When you are using google chrome, the xpath command //div[@class='result-ofertas']//span[@class='location']/a[1] correctly returns the neighborhood of the apartments for sale. You can try this yourself with Chrome's extension xpath helper .

Ok. So I try to download the website with R to automate the extraction of the data, with the XML package:

library(XML)    
site <- "http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/?rn=104123456&pag=1"
html.raw <- htmlTreeParse(site,useInternalNodes=T, encoding="UTF-8")

But when I download the website in R, the page source is not the same anymore.

The previous xpath command results in null:

xpathApply(html.raw, "//div[@class='result-ofertas']//span[@class='location']/a[1]", xmlValue)

But if you mannualy download the website to your computer instead of downloading it with R, the xpath above works just fine.

It seems that R is downloading another webpage (a "mobile" one, it is downloading this one instead of the correct one ), and not the one that it is shown in Chrome.

My problem is not with how to extract the information of this "different" page that R is downloading. I can actually deal with that with the xpath command below:

xpathApply(html.raw, "//p[@class='local']", xmlValue)

But I really would like to understand why and how this is happening.

More specifically:

  1. What is happening here?
  2. Why are the two different webpages (Chrome's and R's), even though the address is the same?
  3. Is there a way to force R to download the exact webpage I see in Chrome (this would be useful, because I usually test the xpath commands with the xpath helper extension).

The site is most likely redirecting requests based on the user agent. Try setting the request user agent in R to match your Chrome user agent (which can be seen on the network tab of the developer tools. Just select a request and view the headers).

I have solved the problem with the download.file() function from the utils package. I first download the file to the HD and then parse it. It takes a long time though, this is not an optimal solution, and I am still not sure why this is happening. So if anyone else has another solution/answer...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM