[英]Automate webscraping with r
I have managed to scrape content for a single url, but am struggling to automate it for multiple urls. 我设法抓取了单个URL的内容,但是正在努力使多个URL自动执行。
Here how it is done for a single page: 这里是如何为单个页面完成的:
library(XML); library(data.table)
theurl <- paste("http://google.com/",url,"/ul",sep="")
convertUTF <- htmlParse(theurl, encoding = "UTF-8")
tables <- readHTMLTable(convertUTF)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
table <- tables[[which.max(n.rows)]]
TableData <- data.table(table)
Now I have a vector of urls and want to scrape each for the corresponding table: 现在,我有一个网址向量,并希望为对应的表格抓取每个网址:
Here, I read in data comprising multiple http links: 在这里,我读取了包含多个http链接的数据:
ur.l <- data.frame(read.csv(file.choose(), header=TRUE, fill=TRUE))
theurl <- matrix(NA, nrow=nrow(ur.l), ncol=1)
for(i in 1:nrow(ur.l)){
url <- as.character(ur.l[i, 2])
}
Each of the three additional urls that you provided refer to pages that contain no tables, so it's not a particularly useful example dataset. 您提供的三个附加url均引用不包含表的页面,因此它不是一个特别有用的示例数据集。 However, a simple way to handle errors is with
tryCatch
. 但是,
tryCatch
是处理错误的简单方法。 Below I've defined a function that reads in tables from url u
, calculates the number of rows for each table at that url, then returns the table with the most rows as a data.table
. 在下面,我定义了一个函数,该函数从url
u
中读取表,计算该url中每个表的行数,然后将具有最多行的表作为data.table
。
You can then use sapply
to apply this function to each url (or, in your case, each org ID, eg 36245119) in a vector. 然后,您可以使用
sapply
将此功能应用于sapply
中的每个url(或您的情况下的每个组织ID,例如36245119)。
library(XML); library(data.table)
scrape <- function(u) {
tryCatch({
tabs <- readHTMLTable(file.path("http://finstat.sk", u, "suvaha"),
encoding='utf-8')
tab <- tabs[[which.max(sapply(tabs, function(x) nrow(x)))]]
data.table(tab)
}, error=function(e) e)
}
urls <- c('36245119', '46894853', '46892460', '46888721')
res <- sapply(urls, scrape)
Take a look at ?tryCatch
if you want to improve the error handling. 如果要改善错误处理,请查看
?tryCatch
。 Presently the function simply returns the errors themselves. 目前,该函数本身仅返回错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.