[英]How to download multiple files using loop in R?
我必须使用R从互联网下载有关一个国家的人口普查数据的多个xlsx文件。文件位于此链接中 。问题是:
我使用了以下提到的代码: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")
但这一次下载一个文件,并且不会更改文件名。
提前致谢。
假设您想要所有数据而又不知道所有URL,那么您的查询将涉及Webparsing。 软件包httr提供了有用的功能,用于检索给定网站的HTML代码,您可以解析该HTML代码以获得链接。
也许这段代码就是您想要的:
library(httr)
base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)] # find links to houslistings
names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
names = gsub("^\\s+|\\s+$", "", names) # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl) # get links
# iterate over regions
for(i in 1:length(links)) {
url_hh = paste0(base_url, "HL_PCA/", links[i])
if(!url_success(url_hh)) next
r <- GET(url_hh)
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl(".xlsx", rcl)] # find links to houslistings
hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
hh_names = gsub("^\\s+|\\s+$", "", hh_names) # trim names
hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl) # get links
# iterate over subregions
for(j in 1:length(hh_links)) {
url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
if(!url_success(url_xlsx)) next
filename = paste0(names[i], "_", hh_names[j], ".xlsx")
download.file(url_xlsx, filename, mode="wb")
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.