如何为 CRAN 上的所有软件包获取 github repo url？

Question

I would like to extract the github repo url for all the packages on CRAN.我想为 CRAN 上的所有软件包提取 github repo url。 And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. But I can't get the completed url.但我无法得到完整的 url。 Could you please help me with this?你能帮我解决这个问题吗？ Or is there any better way to get the repo url for all packages?或者有没有更好的方法来获取所有包的 repo url？

This is my supplementary ： Actually, I want to filter the pkgs that do have a official github repo, like some pkgs as xfun or fddm.这是我的补充：实际上，我想过滤确实有官方 github repo 的 pkgs，比如一些 pkgs 为 xfun 或 fddm。 And I found I can extract the username and repo name from the description of pkgs on CRAN, and put them in a github formatted url.我发现我可以从 CRAN 上的 pkgs 描述中提取用户名和 repo 名称，并将它们放入格式为 url 的 github 中。 (for most of them have the same format url like: https://github.com/{username}/{reponame} . For example, for package xfun , it would be like: https://github.com/yihui/xfun . (for most of them have the same format url like: https://github.com/{username}/{reponame} . For example, for package xfun , it would be like: https://github.com/yihui/乐趣。

And now, I have get some of them like: (three of them)现在，我得到了其中一些：（其中三个）

enter image description here在此处输入图像描述

And I am wondering how could I get the url for all of them.我想知道我怎么能得到所有这些的 url。 I know use glue pkg can replace the elements in a url.我知道使用胶水 pkg 可以替换 url 中的元素。 and for get the url by replacing elements (username and reponame) I have tried map() and map_dfr() function.为了通过替换元素（用户名和reponame）来获取url，我尝试了map（）和map_dfr（）function。 But it returns me error: Error in parse_url(url): length(url) == 1 is not TRUE但它返回我错误： parse_url(url) 中的错误：length(url) == 1 is not TRUE

Here is my code:这是我的代码：

get <- map_dfr(dat, ~{
  
  username <- dat$user
  reponame <- dat$package
  
pkg_url <- GET(glue::glue("https://github.com/{username}/{reponame}"))

})

Could you please help me with this?你能帮我解决这个问题吗？ Thanks a lot: :)非常感谢：：）

Answer 1

I want to suggest a different method for getting where you want.我想建议一种不同的方法来到达你想要的地方。

As discussed in the comments, not all R packages have public GitHub repos.正如评论中所讨论的，并非所有 R 包都有公共 GitHub 存储库。

Here is a version of some code from an answer to another question by Dirk Eddelbuettel that retrieves information from CRAN's database, including the package name and the URL field.这是Dirk Eddelbuettel 对另一个问题的回答中的一些代码版本，它从 CRAN 的数据库中检索信息，包括 package 名称和 URL 字段。 If a package has a public GH repo, it is very likely that the authors have included that information in the URL field: there may be a few packages where the GH repo information is guessable (ie the GH user name is the same as (eg) the identifier in the maintainer's e-mail address; the GH repo name is the same as the package name), but it seems like a lot of work to do all that guessing (and accessing GitHub to see if the guess was correct) for a relatively low return.如果 package 具有公共 GH 存储库，则作者很可能已在 URL 字段中包含该信息：可能有一些软件包的 GH 存储库信息是可猜测的（即 GH 用户名与（例如) 维护者电子邮件地址中的标识符；GH 存储库名称与 package 名称相同），但要完成所有这些猜测（并访问 GitHub 以查看猜测是否正确）似乎需要做很多工作相对较低的回报。

getPackageRDS <- function() {
     description <- sprintf("%s/web/packages/packages.rds",
                            getOption("repos")["CRAN"])
     con <- if(substring(description, 1L, 7L) == "file://") {
         file(description, "rb")
     } else {
         url(description, "rb")
     }
     on.exit(close(con))
     db <- readRDS(gzcon(con))
     rownames(db) <- NULL
     return(db)
}
dd <- as.data.frame(getPackageRDS())
dd2 <- subset(dd, grepl("github.com", URL))
## clean up (multiple URLs, etc.)
dd2$URL <- sapply(strsplit(dd2$URL,"[, \n]"),
       function(x) trimws(grep("github.com", x, value=TRUE)[1]))

As of today (25 May 2021) there are 17665 packages in total on CRAN, of which 6184 have "github.com" in the URL field.截至今天（2021 年 5 月 25 日），CRAN 上共有 17665 个包，其中 6184 个在 URL 字段中有“github.com”。 Here are the first few results:以下是前几个结果：

        Package                                           URL
5        abbyyR              http://github.com/soodoku/abbyyR
12     ABCoptim           http://github.com/gvegayon/ABCoptim
16     abctools     https://github.com/dennisprangle/abctools
18        abdiv        https://github.com/kylebittinger/abdiv
20        abess           https://github.com/abess-team/abess
23 ABHgenotypeR http://github.com/StefanReuscher/ABHgenotypeR

The URL field may still not be completely clean, but this should get you most of the way there. URL 字段可能仍然不完全干净，但这应该可以帮助您完成大部分工作。

An alternative approach would be to use the githubinstall package, which works by downloading a data frame that has been generated by crawling GitHub looking for R packages .另一种方法是使用githubinstall package，它通过下载通过抓取 GitHub 寻找 R 包生成的数据帧来工作。

library(githubinstall)
dd3 <- gh_list_packages()

At present there are 34491 packages in this list, so obviously it includes a lot of stuff that's not on CRAN.目前这个列表中有 34491 个包，所以很明显它包含了很多 CRAN 上没有的东西。 You could intersect this list of packages with information from available_packages() ...您可以将此包列表与来自available_packages()的信息相交...

如何为 CRAN 上的所有软件包获取 github repo url？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-25 16:11:49

如何为 CRAN 上的所有软件包获取 github repo url？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-25 16:11:49

解决方案1
1 已采纳 2021-05-25 16:11:49