[英]how to download a large binary file with RCurl *after* server authentication
i originally asked this question about performing this task with the httr
package, but i don't think it's possible using httr
. 我最初问这个关于用
httr
包执行这个任务的问题 ,但我认为不可能使用httr
。 so i've re-written my code to use RCurl
instead -- but i'm still tripping up on something probably related to the writefunction
.. but i really don't understand why. 所以我重新编写了我的代码来使用
RCurl
- 但我仍然在绊倒可能与writefunction
相关的writefunction
......但我真的不明白为什么。
you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. 您应该能够使用32位版本的R来重现我的工作,因此如果您在RAM中读取任何内容,则会达到内存限制。 i need a solution that downloads directly to the hard disk.
我需要一个直接下载到硬盘的解决方案。
to start, this code to works -- the zipped file is appropriately saved to the disk. 首先,这段代码可以正常工作 - 压缩文件被妥善保存到磁盘上。
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk
now here's some RCurl
code that does not work. 现在这里是一些
RCurl
代码。 as stated in the previous question , reproducing this exactly will require creating an extract on ipums . 如前一个问题所述 ,复制这一点将需要在ipums上创建一个提取。
your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
and now that i'm logged in, try the same commands as above, but with the curl
object to keep the cookies. 现在我已登录,尝试与上面相同的命令,但使用
curl
对象来保留cookie。
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
this line breaks-- 这条线断裂 -
curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?).. 我上一篇文章的答案提到了这个c级写功能的答案,但我对如何重新创建curl_writer C程序(在Windows上?)一无所知。
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo
. ..或者为什么它甚至是必要的,因为这个问题顶部的五行代码没有像
getNativeSymbolInfo
那样疯狂。 i just don't understand why passing in that extra curl
object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break? 我只是不明白为什么传递存储身份验证/ cookie的额外
curl
对象并告诉它不要验证SSL会导致代码无法正常工作..打破?
From this link create a file named curl_writer.c
and save it to C:\\<folder where you save your R files>
从此链接创建一个名为
curl_writer.c
的文件,并将其保存到C:\\<folder where you save your R files>
#include <stdio.h> /** * Original code just sent some message to stderr */ size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) { fwrite(buffer,size,nmemb,(FILE *)stream); return size * nmemb; }
Open a command window, go to the folder where you saved curl_writer.c
and run the R compiler 打开命令窗口,转到保存
curl_writer.c
的文件夹并运行R编译器
c:> cd "C:\\<folder where you save your R files>" c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
Open R and run your script 打开R并运行脚本
C:> R your.email <- "email@address.com" your.password <- "password" extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz" library(RCurl) values <- list( "login[email]" = your.email , "login[password]" = your.password , "login[is_for_login]" = 1 ) curl = getCurlHandle() curlSetOpt( cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, ssl.verifypeer = FALSE, curl = curl ) params <- list( "login[email]" = your.email , "login[password]" = your.password , "login[is_for_login]" = 1 ) html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl) dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl) # Load the DLL you created # "writer" is the name of the function # "curl_writer" is the name of the dll dyn.load("curl_writer.dll") writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address # Note that "URL" parameter is upper case, in your code it is lowercase # I'm not sure if that has something to do # "writer" is the symbol defined above f <- CFILE(filename <- tempfile(), "wb") curlPerform(URL=url, writedata=f@ref, writefunction=writer, curl=curl) close(f)
this is now possible with the httr
package. 现在可以使用
httr
包。 thanks hadley! 谢谢哈德利!
https://github.com/hadley/httr/issues/44 https://github.com/hadley/httr/issues/44
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.