简体   繁体   English

如何使用R中的循环下载多个文件?

[英]How to download multiple files using loop in R?

I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this Link .The problems are: 我必须使用R从互联网下载有关一个国家的人口普查数据的多个xlsx文件。文件位于此链接中 。问题是:

  1. I am unable to write a loop which will let me go back and forth to download 我无法编写一个循环来回下载
  2. File being download has some weird name not districts name. 正在下载的文件具有一些奇怪的名称而不是区域名称。 So how can I change it to districts name dynamically. 因此,如何动态地将其更改为地区名称。

I have used the below mentioned codes: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb") 我使用了以下提到的代码: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")

But this downloads one file at a time and doesnt change the file name. 但这一次下载一个文件,并且不会更改文件名。

Thanks in advance. 提前致谢。

Assuming you want all the data without knowing all of the urls, your questing involves webparsing. 假设您想要所有数据而又不知道所有URL,那么您的查询将涉及Webparsing。 Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links. 软件包httr提供了有用的功能,用于检索给定网站的HTML代码,您可以解析该HTML代码以获得链接。

Maybe this bit of code is what you're looking for: 也许这段代码就是您想要的:

library(httr)

base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)]  # find links to houslistings

names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)              # get names
names = gsub("^\\s+|\\s+$", "", names)          # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl)  # get links

# iterate over regions
for(i in 1:length(links)) {
    url_hh = paste0(base_url, "HL_PCA/", links[i])
    if(!url_success(url_hh)) next

    r <- GET(url_hh)
    rc = content(r, "text")
    rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
  rcl = rcl[grepl(".xlsx", rcl)]  # find links to houslistings

    hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)          # get names
    hh_names = gsub("^\\s+|\\s+$", "", hh_names)          # trim names
    hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl)   # get links

    # iterate over subregions
    for(j in 1:length(hh_links)) {
        url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
      if(!url_success(url_xlsx)) next

        filename = paste0(names[i], "_", hh_names[j], ".xlsx")
        download.file(url_xlsx, filename, mode="wb")
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM