简体   繁体   English

在安装过程中从Internet下载数据的程序包

[英]Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName) ? 是否有人知道在安装过程中从Internet下载数据集的包然后准备并保存它以便在使用library(packageName)加载包时可用? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)? 这种方法是否有任何缺点(除了明显的一个,如果数据源不可用或数据格式发生变化,软件包安装将失败)?

EDIT : Some background. 编辑 :一些背景。 The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. 数据是ZIP存档中的三个以制表符分隔的文件,由联邦统计数据拥有,通常可以自由访问。 I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format. 我有R代码下载,提取和准备数据,最后创建三个数据帧,可以保存为.RData格式。

I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it. 我正在考虑创建两个包:一个提供数据的“数据”包,以及一个对其进行操作的“代码”包。

I did this mockup before, while you were posting your edit. 我之前做过这个模型,当你发布你的编辑时。 I presume it would work, but not tested. 我认为它会起作用,但没有经过测试。 I've commented it so you can see what you would need to change. 我评论过它,你可以看到你需要改变什么。 The idea here is to check to see if an expected object is available in the current working environment. 这里的想法是检查当前工作环境中是否有预期的对象。 If it is not, check to see that the file that the data can be found in is in the current working directory. 如果不是,请检查可以找到数据的文件是否在当前工作目录中。 If that is not found, prompt the user to download the file, then proceed from there. 如果找不到,则提示用户下载文件,然后从那里继续。

myFunction <- function(this, that, dataset) {

  # We're giving the user a chance to specify the dataset.
  #   Maybe they have already downloaded it and saved it.
  if (is.null(dataset)) {

    # Check to see if the object is already in the workspace.
    # If it is not, check to see whether the .RData file that
    #   contains the object is in the current working directory.
    if (!exists("OBJECTNAME", where = 1)) {
      if (isTRUE(list.files(
        pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
        load("DATAFILE.RData")

        # If neither of those are successful, prompt the user
        #   to download the dataset.
      } else {
        ans = readline(
          "DATAFILE.RData dataset not found in working directory.
          OBJECTNAME object not found in workspace. \n
          Download and load the dataset now? (y/n) ")
        if (ans != "y")
          return(invisible())

        # I usually use RCurl in case the URL is https
        require(RCurl)
        baseURL = c("http://some/base/url/")

        # Here, we actually download the data
        temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))

        # Here we load the data
        load(rawConnection(temp), envir=.GlobalEnv)
        message("OBJECTNAME data downloaded from \n",
                paste0(baseURL, "DATAFILE.RData \n"), 
                "and added to your workspace\n\n")
        rm(temp, baseURL)
      }
    }
    dataset <- OBJECTNAME
  }
  TEMP <- dataset
  ## Other fun stuff with TEMP, this, and that.
}

Two packages, hosted at Github 两个包,在Github上托管

Here's another approach, building on the comments between @juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. 这是另一种方法,基于@juba和I之间的注释。基本概念是,如您所述,有一个代码包和一个数据包。 This function would be part of the package that contains your code. 此函数将是包含代码的包的一部分。 It will: 它会:

  1. Check to see if the data package is installed 检查数据包是否已安装
  2. Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version. 检查您安装的数据包的版本是否与Github上的版本匹配,我们将假设它是最新版本。

When it fails any of the checks, it asks the user if they want to update their installation of the package. 当任何检查失败时,它会询问用户是否要更新其软件包的安装。 In this case, for demonstration, I've linked to one of my packages in progress at Github. 在这种情况下,为了演示,我已经链接到我在Github上正在进行的一个包。 This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there. 这可以让您了解在托管它之后需要替换它以使其与您自己的包一起使用。

CheckVersionFirst <- function() {
  # Check to see if installed
  if (!"StataDCTutils" %in% installed.packages()[, 1]) {
    Checks <- "Failed"
  } else {
    # Compare version numbers
    require(RCurl)
    temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
    CurrentVersion <- gsub("^\\s|\\s$", "", 
                           gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
    if (packageVersion("StataDCTutils") == CurrentVersion) {
      Checks <- "Passed"
    }
    if (packageVersion("StataDCTutils") < CurrentVersion) {
      Checks <- "Failed"
    }
  }

  switch(
    Checks,
    Passed = { message("Everything looks OK! Proceeding!") },
    Failed = {
      ans = readline(
        "'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
      if (ans != "y")
        return(invisible())
      require(devtools)
      install_github("StataDCTutils", "mrdwab")
    })
# Some cool things you want to do after you are sure the data is there
}

Try it out with CheckVersionFirst() . 使用CheckVersionFirst()尝试一下。

Note : This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github! 注意 :只有在每次将新版本的数据推送到Github时,您都会记得更新描述文件中的版本号,这才会成功!

So, to clarify/recap/expand, the basic idea would be to: 因此,为了澄清/回顾/扩展,基本思路是:

  • Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so. 定期将数据包的更新版本推送到Github,确保在执行此操作时更改其DESCRIPTION文件中的数据包的版本号。
  • Integrate this CheckVersionFirst() function as an .onLoad event in your code package. 将此CheckVersionFirst()函数集成为代码包中的.onLoad事件。 (Obviously modify the function to match your account and package name). (显然修改功能以匹配您的帐户和包名称)。
  • Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data.... 更改注释的注释行# Some cool things you want to do after you are sure the data is there以反映您实际想要做的很酷的事情# Some cool things you want to do after you are sure the data is there ,这可能从library(YOURDATAPACKAGE)开始加载数据....

This method may not be efficient, but a good workaround. 这种方法可能效率不高,但是一种很好的解决方法。 If you are making a package that needs regularly updated data, first make a package which has that data. 如果您正在制作需要定期更新数据的包,请首先制作包含该数据的包。 It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter. 它不需要任何函数,但我喜欢setter的概念(在这种情况下你可能不需要)和getter。

Then when you make your package, have the 'data'-package as a dependency. 然后在制作包时,将'data'包作为依赖项。 This way, whenever someone installs your package, he/she will always have the latest data. 这样,每当有人安装您的包时,他/她将始终拥有最新数据。

On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want. 您只需更换“数据”包中的数据,然后将其上传到您想要的仓库即可。

If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK , R CMD BUILD 如果您不知道如何构建软件包,请检查?packages.skeletonR CMD CHECKR CMD BUILD

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM