用R循环遍历网址

Question

I need to download a series of Excel files from URL's that all look as follows: 我需要从URL下载一系列Excel文件，如下所示：

http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31
http://example.com/orResultsED.cfm?MODE=exED&ED=02&EventId=31
...
http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31

I've got some of the building blocks inside the loop, such as: 我在循环中有一些构建基块，例如：

for(i in 1:87) {
    url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i, "&EventId=31")
    file <- paste0("Data/myExcel_", i, ".xlsx")
    if (!file.exists(file)) download.file(url, file) 
}

My problems : 我的问题 ：

I need the seq to prepend the 0 (I tried sprintf with no luck) 我需要seq放在0之前（我没有运气就尝试了sprintf ）
I also want to import the Excel files, skip the first two rows and append them on after the other (they also have the same columns) 我也想导入Excel文件，跳过前两行，然后将它们附加在另一行之后（它们也具有相同的列）

Update 更新资料

@akrun solution works well. @akrun解决方案效果很好。 But it turns out not all my Excel files have the same number of columns: 但事实证明，并非我所有的Excel文件都具有相同的列数：

map(files, ~read.xlsx(.x, 
                         colNames = FALSE,
                         sheet = 1, 
                         startRow = 4,
                         )) %>%
  bind_rows

Error in bind_rows_(x, .id) : 
  Column `X1` can't be converted from numeric to character

I think this error actually points to the unequal number of column. 我认为此错误实际上指向不相等的列数。 I tried adding fill = NA (when testing map_df() ), but it didn't help. 我尝试添加fill = NA （在测试map_df() ），但没有帮助。

Answer 1

We can create it with sprintf 我们可以用sprintf创建它

paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", sprintf("%02d", 1), "&EventId=31")
#[1] "http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31"

In the loop, 在循环，

for(i in 1:87) {
  i1 <- sprintf('%02d', i)
   url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i1, "&EventId=31")
   file <- paste0("Data/myExcel_", i, ".xlsx")
   if (!file.exists(file)) download.file(url, file) 
}

Assuming that the files are downloaded in the working directory 假设文件已下载到工作目录中

files <- list.files(full.names = TRUE)
library(openxlsx)
library(purrr)
library(dplyr)
map(files, ~read.xlsx(.x, sheet = 1, startRow = 3))  %>%
      bind_rows

Or as @hrbrmstr mentioned in the comments, map_df can be used which returns a single dataset 或如评论中提到的map_df可以使用map_df返回单个数据集

map_df(files, ~read.xlsx(.x, sheet = 1, startRow = 3))

Update 更新资料

Based on the comments from OP, there seems to be a difference in column class for some of the datasets. 根据OP的评论，某些数据集的列类似乎有所不同。 In that case, bind_rows gives an error. 在这种情况下， bind_rows会给出错误。 One option is to use rbindlist from data.table 一种选择是使用rbindlist的data.table

map(files, ~read.xlsx(.x, sheet = 1, startRow = 3))  %>%
      data.table::rbindlist(fill = TRUE)

Answer 2

downloading and reading in 1 loop. 1次循环下载和阅读。 Hopefully, the columns are aligned if not use something like plyr::rbind.fill instead of do.call(rbind, list) 希望如果不使用plyr::rbind.fill而不是do.call(rbind, list)类的话，列是对齐的

do.call(rbind, lapply(1:87, function(n) {
    url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", 
        sprintf("%02d", n), "&EventId=31")
    file <- paste0("Data/myExcel_", n, ".xlsx")
    if (!file.exists(file)) download.file(url, file) 
    readxl::read_excel(file, skip=2)
    Sys.sleep(5)
}))

Answer 3

you can also use regmatches 您还可以使用regmatches

 num=sprintf("%02.0f",1:87)
 urls=rep("http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31",87)
`regmatches`(urls,regexpr("\\d+",urls))<-num
 urls[87]
[1] "http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31"

To have all the files: 要拥有所有文件：

 files <- paste0("Data/myExcel_",num , ".xlsx")

to download the files: 下载文件：

  mapply(function(x,y)if(!file.exists(x))download.file(y,x),files,urls)

用R循环遍历网址

问题描述

Update 更新资料

3 个解决方案

解决方案1
4 已采纳 2018-01-29 01:26:50

Update 更新资料

解决方案2
3 2018-01-29 01:35:25

解决方案3
2 2018-01-29 01:38:56

用R循环遍历网址

问题描述

Update 更新资料

3 个解决方案

解决方案1 4 已采纳 2018-01-29 01:26:50

Update 更新资料

解决方案2 3 2018-01-29 01:35:25

解决方案3 2 2018-01-29 01:38:56

解决方案1
4 已采纳 2018-01-29 01:26:50

解决方案2
3 2018-01-29 01:35:25

解决方案3
2 2018-01-29 01:38:56