简体   繁体   English

在R中下载csv文件

[英]download csv file in R

I'm trying to download historical stock trading from my country with R. I tried with the download.file() function. 我正在尝试使用R从我的国家/地区下载历史股票交易。我尝试了download.file()函数。 Indeed, a file is downloaded but is an empty spreadsheet. 实际上,已下载了一个文件,但它是一个空的电子表格。 Obviously, if I use this url in my browser the file I downloaded is in fact the one I want. 显然,如果我在浏览器中使用此URL,则下载的文件实际上就是我想要的文件。

I would love to do it with quantmod, but that package only applies to larger markets 我很乐意使用Quantmod来做到这一点,但该软件包仅适用于较大的市场

url<-"https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"
destfile <- "/home/hector/TxHistoricas.xls"
download.file(url, destfile)

Thanks in advance. 提前致谢。

You can jury-rig something like this if you don't want to use selenium: 如果您不想使用硒,则可以评审这样的事情:

library(rvest)
library(httr)
library(stringr)

URL <- "https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"

Get initial URL: 获取初始URL:

res <- html_session(URL, timeout(30))

It embeds a form that it uses javascript to submit to get the form: 它嵌入一个使用javascript提交以获取该表单的表单:

inputs <- html_nodes(res, "input")

It uses the last javascript entry to do a redirect on page load, so we need the location of it: 它使用最后一个javascript条目在页面加载时进行重定向,因此我们需要它的位置:

scripts <- html_nodes(res, "script")
action <- html_text(scripts[[length(scripts)]])

This is the new URL to submit to: 这是要提交到的新URL:

base_url <- "https://www.ccbolsa.cl/apps/script/detalleaccion"
loc <- str_match(action, '\\.action *= *"(.*)"')[,2]
doc_url <- sprintf("%s/%s", base_url, loc)

Gather up all the query params: 收集所有查询参数:

query <- lapply(inputs, xml_attr, "value")
names(query) <- sapply(inputs, xml_attr, "name")

Now we have to make a new POST request with the query encoded as "form", using and providing a redirect URL (timeout was necessary for me). 现在,我们必须使用并提供重定向URL(对于我而言,超时是必需的),使用编码为“ form”的查询发出一个新的POST请求。 This write the "xls" content to a file: 这将“ xls”内容写入文件:

ret <- POST(doc_url, 
            body=query, 
            encode="form",
            add_headers(Referer=URL),
            write_disk("fil.xls", overwrite=TRUE),
            timeout(30))

It says it's an XLS file: 这是一个XLS文件:

ret$headers$`content-type`
## [1] "application/vnd.ms-excel"

but it's really an HTML table, so you can really just do: 但这实际上是一个HTML表,因此您可以执行以下操作:

ret <- POST(doc_url, 
            body=query, 
            encode="form",
            add_headers(Referer=URL),
            timeout(30))

doc <- read_html(content(ret, as="text"))
dat <- html_table(html_nodes(doc, "table"), fill=TRUE)

to get what you're looking for (there are two ugly tables in the dat list and you may want to use header=TRUE as an additional parameter to html_table ). 来获取您想要的内容( dat列表中有两个丑陋的表,您可能希望使用header=TRUE作为html_table的附加参数)。

I am not sure how "dynamic" this solution but that's test-able/verifiable. 我不确定该解决方案的“动态性”如何,但这是可测试/可验证的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM