简体   繁体   English

使用 R 在 github 存储库中获取 csv 文件的元数据? (即 file.info,但用于在线文件)

[英]Get metadata on csv file in a github repo with R? (i.e., file.info but for online files)

Is there a simple, non-API R command or function to get the basic metadata on a csv file that is in a github repository?是否有一个简单的非 API R 命令或 function 来获取 github 存储库中 csv 文件的基本元数据? I especially need: (1) Date of last commit and (2) size in bytes, which I'm trying to pull into an RMarkdown document.我特别需要:(1) 上次提交的日期和 (2) 字节大小,我正试图将其放入 RMarkdown 文档中。

Here is an example file这是一个示例文件

I don't know of a simple function to do this, but you can write a little web scraping function with rvest to do the job:我不知道一个简单的 function 可以做到这一点,但你可以写一点 web 用 rvest 抓取rvest来完成这项工作:

library(rvest)

file_metadata <- function(url) {
  
  page <- read_html(url)
  
  file <- tail(strsplit(url, "/")[[1]], 1)
  div1 <- "text-mono f6 flex-auto pr-3 flex-order-2 flex-md-order-1"
  
  size <- page %>%
    html_elements(xpath = paste0("//div[@class='", div1, "']")) %>%
    html_text() %>%
    strsplit("\n") %>%
    sapply(trimws) %>%
    getElement(5)
  
  last_commit <- page %>% 
    html_elements("relative-time") %>% 
    html_attr("datetime") %>%
    as.POSIXct()
  
  data.frame(file, size, last_commit)
}

Testing it on your example file url, we have:在您的示例文件 url 上测试它,我们有:

file_metadata(example_file)
#>                  file    size last_commit
#> 1 EB_data_example.csv 1.32 KB  2022-01-18

Created on 2022-10-04 with reprex v2.0.2创建于 2022-10-04,使用reprex v2.0.2


Example file url in full完整示例文件 url

example_file<- paste0("https://github.com/BrunaLab/LAS6292_DataManagement/",
              "blob/4b856c2fad350edaded78fba671023b8c544b1dd/",
              "static/course-materials/class-sessions/03-spreadsheets/examples/",
              "EB_data_example.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM