简体   繁体   English

抓取API时'RCurl'[R]打包getURL网页错误

[英]'RCurl' [R] package getURL webpage error when scraping API

I am trying to scrape data on pages from an API using the getURL function of the RCurl package in R. My problem is that I can't replicate the response that I get when I open the URL in Chrome when I make the request using R. Essentially, when I open the API page (url below) in Chrome it works fine but if I request it in using getURL in R (or using incognito mode in Chrome) I get a '500 Internal Server Error' response and not the pretty JSON that I'm looking for. 我正在尝试使用R中RCurl包的getURL函数从API中抓取页面上的数据。我的问题是当我使用R发出请求时,我无法复制在Chrome中打开URL时得到的响应基本上,当我在Chrome中打开API页面(下面的url)时,它工作正常,但如果我在R中使用getURL请求它(或在Chrome中使用隐身模式),我得到'500内部服务器错误'响应而不是漂亮的我正在寻找的JSON。

URL/API in question: http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082 有问题的网址/ API: http//www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country = USA&currency = USD& language = en-us&productSet = NN& sku = LD04077082

Here is my (failed) request in [R]. 这是我在[R]中的(失败的)请求。

test2 <- fromJSON(getURL("http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082", ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"))

My Research so Far First I looked at this prior question on stack and added in my useragent to the request (did not solve problem but may still be necessary): ViralHeat API issues with getURL() command in RCurl package 我的研究到目前为止我首先在堆栈上查看了这个先前的问题并在我的useragent中添加了请求(没有解决问题但可能仍然是必要的): RCurl包中的getURL()命令的ViralHeat API问题

Next I looked at this helpful post which guides my rationale: R Disparity between browser and GET / getURL 接下来,我查看了这个有用的帖子,它指导了我的理由: R浏览器和GET / getURL之间的差异

My Ideas About the Solution This is not my area of expertise but my guess is that the request is lacking a cookie needed to complete the request (hence why it doesn't work in my browser in incognito mode). 我对解决方案的看法这不是我的专业领域,但我的猜测是请求缺少完成请求所需的cookie(因此它在隐身模式下无法在我的浏览器中运行)。 I compared the requests and responses from the successful request to the unsuccessful request: 我将成功请求的请求和响应与不成功的请求进行了比较:

Successful request: 成功要求: 在此输入图像描述

Unsuccessful request: 不成功的请求:

在此输入图像描述

Anyone have any ideas? 有人有主意吗? Should I try using the package RSelenium package that was suggested by MrFlick in the 2nd post I made. 我应该尝试使用MrFlick在我发表的第二篇文章中建议的软件包RSelenium软件包。

This is a courteous site. 这是一个有礼貌的网站。 It would like to know where you come from what currency you use etc. to give you a better user experience. 它想知道您使用的货币来自哪里等,以便为您提供更好的用户体验。 It does this by setting a multitude of cookies on the landing page. 它通过在目标网页上设置大量Cookie来实现此目的。 So we follow suit and navigate to the landing page first getting the cookies then we goto the page we want: 所以我们跟着并导航到登陆页面首先获取cookie然后我们转到我们想要的页面:

library(RCurl)
myURL <- "http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082"
agent="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0"

#Set RCurl pars
curl = getCurlHandle()
curlSetOpt(cookiejar="cookies.txt",  useragent = agent, followlocation = TRUE, curl=curl)
firstPage <- getURL("http://www.bluenile.com", curl=curl)
myPage <- getURL(myURL, curl = curl)

library(RJSONIO)
> names(fromJSON(myPage))
[1] "diamondDetailsHeader" "diamondDetailsBodies" "pageMetadata"         "expandedUrl"         
[5] "newVersion"           "multiDiamond"  

and the cookies: 和饼干:

> getCurlInfo(curl)$cookielist
 [1] ".bluenile.com\tTRUE\t/\tFALSE\t2412270275\tGUID\tDA5C11F5_E468_46B5_B4E8_D551D4D6EA4D"                                                                    
 [2] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tsplit\tver~3&presetFilters~TEST"                                                                               
 [3] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tsitetrack\tver~2&jse~0"                                                                                        
 [4] ".bluenile.com\tTRUE\t/\tFALSE\t1425230275\tpop\tver~2&china~false&french~false&ie~false&internationalSelect~false&iphoneApp~false&survey~false&uae~false" 
 [5] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tdsearch\tver~6&newUser~true"                                                                                   
 [6] ".bluenile.com\tTRUE\t/\tFALSE\t1443806275\tlocale\tver~1&country~IRL&currency~EUR&language~en-gb&productSet~BNUK"                                         
 [7] ".bluenile.com\tTRUE\t/\tFALSE\t0\tbnses\tver~1&ace~false&isbml~false&fbcs~false&ss~0&mbpop~false&sswpu~false&deo~false"                                   
 [8] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tbnper\tver~5&NIB~0&DM~-&GUID~DA5C11F5_E468_46B5_B4E8_D551D4D6EA4D&SESS-CT~1&STC~32RPVK&FB_MINI~false&SUB~false"
 [9] "#HttpOnly_www.bluenile.com\tFALSE\t/\tFALSE\t0\tJSESSIONID\tB8475C3AEC08205E5AC6252C94E4B858"                                                             
[10] ".bluenile.com\tTRUE\t/\tFALSE\t1727630278\tmigrationstatus\tver~1&redirected~false"     

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM