简体   繁体   English

R - curl - 仅在更改时下载远程文件

[英]R - curl - download remote file only when changed

For a project I need to download regularly data files from different websites to create an indicator based on those files.对于一个项目,我需要定期从不同网站下载数据文件,以根据这些文件创建指标。

As the update frequency of those files varies a lot, I am looking for an efficient way to detect whether a remote file was updated.由于这些文件的更新频率变化很大,我正在寻找一种有效的方法来检测远程文件是否已更新。

Below is suggested to use the -I option of curl.下面建议使用 curl 的 -I 选项。 How does this translate in using the curl package?这如何转化为使用 curl 包?

https://superuser.com/questions/619592/get-modification-time-of-remote-file-over-http-in-bash-script https://superuser.com/questions/619592/get-modification-time-of-remote-file-over-http-in-bash-script

Alternate solutions seem to parse the header for either filesize or modifcation date:替代解决方案似乎解析文件大小或修改日期的标头:

Something similar to:类似于:

PHP: Remote file size without downloading file PHP:无需下载文件的远程文件大小

My attempt below (with a small file), however, downloads the full file.但是,我在下面的尝试(使用小文件)下载了完整文件。

library(curl)


req <- curl_fetch_memory("http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata")
str(req)
object.size(req)
parse_headers(req$headers)

Ist it possible to either download just the header with the curl package or to specify an option to avoid redundant downloads?是否可以只下载带有 curl 包的标头或指定一个选项来避免冗余下载?

You'll have to keep a history of last-modified dates of the files (assuming the web server is consistent in reporting that) and check that with httr::HEAD() before downloading (ie you have some work to do vis a vis storing that last-modified value somewhere, probably in a data frame with the URL):您必须保留文件上次修改日期的历史记录(假设 Web 服务器在报告中保持一致)并在下载前使用httr::HEAD()检查(即您有一些工作要做,将最后修改的值存储在某处,可能在带有 URL 的数据框中):

library(httr)

URL <- "http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata"

#' Download a file only if it hasn't changed since \code{last_modified}
#' 
#' @param URL url of file
#' @param fil path to write file
#' @param last_modified \code{POSIXct}. Ideally, the output from the first 
#'        successful run of \code{get_file()}
#' @param overwrite overwrite the file if it exists?
#' @param .verbose output a message if the file was unchanged?
get_file <- function(URL, fil, last_modified=NULL, overwrite=TRUE, .verbose=TRUE) {

  if ((!file.exists(fil)) || is.null(last_modified)) {
    res <- GET(URL, write_disk(fil, overwrite))
    return(httr::parse_http_date(res$headers$`last-modified`))
  } else if (inherits(last_modified, "POSIXct")) {
    res <- HEAD(URL)
    cur_last_mod <- httr::parse_http_date(res$headers$`last-modified`)
    if (cur_last_mod != last_modified) {
      res <- GET(URL, write_disk(fil, overwrite))
      return(httr::parse_http_date(res$headers$`last-modified`))
    }
    if (.verbose) message(sprintf("'%s' unchanged since %s", URL, last_modified))
    return(last_modified)
  } 

}

# first run == you don't know the last-modified date.
# you need to pair this with the URL in some data structure for later use.
last_mod <- get_file(URL, basename(URL))

class(last_mod)
## [1] "POSIXct" "POSIXt"

last_mod
## [1] "2015-11-16 17:34:06 GMT"

last_mod <- get_file(URL, basename(URL), last_mod)
#> 'http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata' unchanged since 2015-11-16 17:34:06

httr包的替代方法是base函数base::curlGetHeaders(url) ,但您仍然需要自己解析上次修改日期!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R - 更改站点后使用 curl 中的 getURL 时出错 - R - Error when using getURL from curl after site was changed 使用 curl 或 RCurl 在 R 中下载和解压缩 JSON 文件 - Download & decompress JSON file in R using curl or RCurl 在R中执行curl语句以下载JSON文件 - Execute curl statement in R to download JSON-file 设置curl选项以提高R download.file()中进度的可读性 - Set curl options to improve readability of progress in R download.file() 为什么method =&#39;curl&#39;在R中的download.file中不起作用? - why method = 'curl' is not working in download.file in R? 强制内容类型 image/* 仅下载.file R - Force content type image/* only download.file R 仅使用临时文件下载,解压缩和加载R中的Excel文件 - Download, unzip, and load Excel file in R using tempfiles only 使用 R curl 下载 Twitter 页面时,下载的页面是“不再支持此浏览器” - When using R curl to download a Twitter page, the page downloaded is “This browser is no longer supported” 如何在不使用R读取整个文件的情况下确定远程下载的文件大小 - How to determine the file size of a remote download without reading the entire file with R 从外部oneDrive下载csv文件使用r中的download.files()仅提供文本/ html文件 - download csv file from external oneDrive gives only text/html file using download.files() in r
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM