简体   繁体   中英

How to get the github repo url for all the packages on CRAN?

I would like to extract the github repo url for all the packages on CRAN. And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. But I can't get the completed url. Could you please help me with this? Or is there any better way to get the repo url for all packages?

This is my supplementary : Actually, I want to filter the pkgs that do have a official github repo, like some pkgs as xfun or fddm. And I found I can extract the username and repo name from the description of pkgs on CRAN, and put them in a github formatted url. (for most of them have the same format url like: https://github.com/{username}/{reponame} . For example, for package xfun , it would be like: https://github.com/yihui/xfun .

And now, I have get some of them like: (three of them)

enter image description here

And I am wondering how could I get the url for all of them. I know use glue pkg can replace the elements in a url. and for get the url by replacing elements (username and reponame) I have tried map() and map_dfr() function. But it returns me error: Error in parse_url(url): length(url) == 1 is not TRUE

Here is my code:

get <- map_dfr(dat, ~{
  
  username <- dat$user
  reponame <- dat$package
  
pkg_url <- GET(glue::glue("https://github.com/{username}/{reponame}"))

})

Could you please help me with this? Thanks a lot: :)

I want to suggest a different method for getting where you want.

As discussed in the comments, not all R packages have public GitHub repos.

Here is a version of some code from an answer to another question by Dirk Eddelbuettel that retrieves information from CRAN's database, including the package name and the URL field. If a package has a public GH repo, it is very likely that the authors have included that information in the URL field: there may be a few packages where the GH repo information is guessable (ie the GH user name is the same as (eg) the identifier in the maintainer's e-mail address; the GH repo name is the same as the package name), but it seems like a lot of work to do all that guessing (and accessing GitHub to see if the guess was correct) for a relatively low return.

getPackageRDS <- function() {
     description <- sprintf("%s/web/packages/packages.rds",
                            getOption("repos")["CRAN"])
     con <- if(substring(description, 1L, 7L) == "file://") {
         file(description, "rb")
     } else {
         url(description, "rb")
     }
     on.exit(close(con))
     db <- readRDS(gzcon(con))
     rownames(db) <- NULL
     return(db)
}
dd <- as.data.frame(getPackageRDS())
dd2 <- subset(dd, grepl("github.com", URL))
## clean up (multiple URLs, etc.)
dd2$URL <- sapply(strsplit(dd2$URL,"[, \n]"),
       function(x) trimws(grep("github.com", x, value=TRUE)[1]))

As of today (25 May 2021) there are 17665 packages in total on CRAN, of which 6184 have "github.com" in the URL field. Here are the first few results:

        Package                                           URL
5        abbyyR              http://github.com/soodoku/abbyyR
12     ABCoptim           http://github.com/gvegayon/ABCoptim
16     abctools     https://github.com/dennisprangle/abctools
18        abdiv        https://github.com/kylebittinger/abdiv
20        abess           https://github.com/abess-team/abess
23 ABHgenotypeR http://github.com/StefanReuscher/ABHgenotypeR

The URL field may still not be completely clean, but this should get you most of the way there.


An alternative approach would be to use the githubinstall package, which works by downloading a data frame that has been generated by crawling GitHub looking for R packages .

library(githubinstall)
dd3 <- gh_list_packages()

At present there are 34491 packages in this list, so obviously it includes a lot of stuff that's not on CRAN. You could intersect this list of packages with information from available_packages() ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM