如何使用R中的循环下载多个文件？

Question

I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this Link .The problems are: 我必须使用R从互联网下载有关一个国家的人口普查数据的多个xlsx文件。文件位于此链接中。问题是：

I am unable to write a loop which will let me go back and forth to download 我无法编写一个循环来回下载
File being download has some weird name not districts name. 正在下载的文件具有一些奇怪的名称而不是区域名称。 So how can I change it to districts name dynamically. 因此，如何动态地将其更改为地区名称。

I have used the below mentioned codes: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb") 我使用了以下提到的代码： url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")

But this downloads one file at a time and doesnt change the file name. 但这一次下载一个文件，并且不会更改文件名。

Thanks in advance. 提前致谢。

Answer 1

Assuming you want all the data without knowing all of the urls, your questing involves webparsing. 假设您想要所有数据而又不知道所有URL，那么您的查询将涉及Webparsing。 Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links. 软件包httr提供了有用的功能，用于检索给定网站的HTML代码，您可以解析该HTML代码以获得链接。

Maybe this bit of code is what you're looking for: 也许这段代码就是您想要的：

library(httr)

base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)]  # find links to houslistings

names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)              # get names
names = gsub("^\\s+|\\s+$", "", names)          # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl)  # get links

# iterate over regions
for(i in 1:length(links)) {
    url_hh = paste0(base_url, "HL_PCA/", links[i])
    if(!url_success(url_hh)) next

    r <- GET(url_hh)
    rc = content(r, "text")
    rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
  rcl = rcl[grepl(".xlsx", rcl)]  # find links to houslistings

    hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)          # get names
    hh_names = gsub("^\\s+|\\s+$", "", hh_names)          # trim names
    hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl)   # get links

    # iterate over subregions
    for(j in 1:length(hh_links)) {
        url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
      if(!url_success(url_xlsx)) next

        filename = paste0(names[i], "_", hh_names[j], ".xlsx")
        download.file(url_xlsx, filename, mode="wb")
    }
}

如何使用R中的循环下载多个文件？

问题描述

1 个解决方案

解决方案1
1 2015-08-27 08:41:32

如何使用R中的循环下载多个文件？

问题描述

1 个解决方案

解决方案1 1 2015-08-27 08:41:32

解决方案1
1 2015-08-27 08:41:32